You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by "Miklosovic, Stefan" <St...@netapp.com> on 2023/02/03 09:09:45 UTC

Implicitly enabling ALLOW FILTERING on virtual tables

Hi list,

the content of virtual tables is held in memory (and / or is fetched every time upon request). While doing queries against such table for a column outside of primary key, normally, users are required to specify ALLOW FILTERING. This makes total sense for "ordinary tables" for applications to have performant and effective queries but it kinds of loses the applicability for virtual tables when it literally holds just handful of entries in memory and it just does not matter, does it?

What do you think about implicitly allowing filtering for virtual tables so we save ourselves from these pesky errors when we want to query arbitrary column and we need to satisfy CQL spec just to do that?

It is not like we would promote bad habits because virtual tables are meant to be queried by operators / administrators only.

We can also explicitly document this behavior.

Among other options, we may try to implement secondary indices on virtual tables but I am not completely sure this is what we want because its complexity etc. Is it even necessary to put such complex logic in place just to be able to select any column on few entries in memory?

I put together a draft here (1). It would be ever possible to implicitly allow filtering on virtual tables only and it would be implementator's responsibility to decide that, per table.

For all virtual tables we currently have, I would enable this everywhere. I do not think there is any virtual table where we would not want to enable it or where people HAVE TO specify that.

(1) https://github.com/apache/cassandra/pull/2131

Re: Implicitly enabling ALLOW FILTERING on virtual tables

Posted by "Miklosovic, Stefan" <St...@netapp.com>.
While that might technically work, Benedict, I am afraid that if we enable users to have this kind of power, they would start to set ALLOW FILTERING here and there in order to not think twice about their data model so they can just call it a day.

At the same time, we have a guardrail for allowing filtering. If we set a table to be allowed to be filtered on and we would have a guardrail to forbid it, which approach would be applied?

________________________________________
From: Benedict <be...@apache.org>
Sent: Friday, February 3, 2023 22:13
To: dev@cassandra.apache.org
Subject: Re: Implicitly enabling ALLOW FILTERING on virtual tables

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.



Why not introduce a general table option that toggles ALLOW FILTERING behaviour and just flip it for virtual tables we want this behaviour for? Users can do it too, for their own tables for which it’s suitable.

On 3 Feb 2023, at 20:59, Andrés de la Peña <ad...@apache.org> wrote:


For those eventual big virtual tables there is the mentioned flag indicating whether the table allows filtering without AF.

I guess the question is how can a user know whether a certain virtual table is one of the big ones. That could be specified in the doc for each table, and it could also be included in the table properties, so it's displayed by DESCRIBE TABLE queries.

On Fri, 3 Feb 2023 at 20:56, Chris Lohfink <cl...@gmail.com>> wrote:
Just to 2nd what Scott days. While everything is in memory now, it may not be in the future, and if we add it implicitly, we are tying ourselves to be in memory only. However, I wouldn't -1 the idea.

Another option may be a cqlsh option (ie like expand on/off) to always include a flag so it doesnt need to be added or something.

Chris

On Fri, Feb 3, 2023 at 1:24 PM C. Scott Andreas <sc...@paradoxica.net>> wrote:
There are some ideas that development community members have kicked around that may falsify the assumption that "virtual tables are tiny and will fit in memory."

One example is CASSANDRA-14629: Abstract Virtual Table for very large result sets
https://issues.apache.org/jira/browse/CASSANDRA-14629

Chris's proposal here is to enable query results from virtual tables to be streamed to the client rather than being fully materialized. There are some neat possibilities suggested in this ticket, such as debug functionality to dump the contents of a raw SSTable via the CQL interface, or the contents of the database's internal caches. One could also imagine a feature like this providing functionality similar to a foreign data wrapper in other databases.

I don't think the assumption that "virtual tables will always be small and always fit in memory" is a safe one.

I don't think we should implicitly add "ALLOW FILTERING" to all queries against virtual tables because of this, in addition to concern with departing from standard CQL semantics for a type of tables deemed special.

– Scott

On Feb 3, 2023, at 6:52 AM, Maxim Muzafarov <mm...@apache.org>> wrote:


Hello Stefan,

Regarding the decision to implicitly enable ALLOW FILTERING for
virtual tables, which also makes sense to me, it may be necessary to
consider changing the clustering columns in the virtual table metadata
to regular columns as well. The reasons are the same as mentioned
earlier: the virtual tables hold their data in memory, thus we do not
benefit from the advantages of ordered data (e.g. the ClientsTable and
its ClusteringColumn(PORT)).

Changing the clustering column to a regular column may simplify the
virtual table data model, but I'm afraid it may affect users who rely
on the table metadata.



On Fri, 3 Feb 2023 at 12:32, Andrés de la Peña <ad...@apache.org>> wrote:

I think removing the need for ALLOW FILTERING on virtual tables makes sense and would be quite useful for operators.

That guard exists for performance issues that shouldn't occur on virtual tables. We also have a flag in case some future virtual table implementation has limitations regarding filtering, although it seems it's not the case with any of the existing virtual tables.

It is not like we would promote bad habits because virtual tables are meant to be queried by operators / administrators only.


It might even be quite the opposite, since in the current situation users might get used to routinely use ALLOW FILTERING for querying their virtual tables.

It has been mentioned on the #cassandra-dev Slack thread where this started (1) that it's kind of an API inconsistency to allow querying by non-primary keys on virtual tables without ALLOW FILTERING, whereas it's required for regular tables. I think that a simply doc update saying that virtual tables, which are not regular tables, support filtering would be enough. Virtual tables are well identified by both the keyspace they belong to and doc, so users shouldn't have trouble knowing whether a table is virtual. It would be similar to the current exception for ALLOW FILTERING, where one needs to use it unless the table has an index for the queried column.

(1) https://the-asf.slack.com/archives/CK23JSY2K/p1675352759267329

On Fri, 3 Feb 2023 at 09:09, Miklosovic, Stefan <St...@netapp.com>> wrote:

Hi list,

the content of virtual tables is held in memory (and / or is fetched every time upon request). While doing queries against such table for a column outside of primary key, normally, users are required to specify ALLOW FILTERING. This makes total sense for "ordinary tables" for applications to have performant and effective queries but it kinds of loses the applicability for virtual tables when it literally holds just handful of entries in memory and it just does not matter, does it?

What do you think about implicitly allowing filtering for virtual tables so we save ourselves from these pesky errors when we want to query arbitrary column and we need to satisfy CQL spec just to do that?

It is not like we would promote bad habits because virtual tables are meant to be queried by operators / administrators only.

We can also explicitly document this behavior.

Among other options, we may try to implement secondary indices on virtual tables but I am not completely sure this is what we want because its complexity etc. Is it even necessary to put such complex logic in place just to be able to select any column on few entries in memory?

I put together a draft here (1). It would be ever possible to implicitly allow filtering on virtual tables only and it would be implementator's responsibility to decide that, per table.

For all virtual tables we currently have, I would enable this everywhere. I do not think there is any virtual table where we would not want to enable it or where people HAVE TO specify that.

(1) https://github.com/apache/cassandra/pull/2131



Re: Implicitly enabling ALLOW FILTERING on virtual tables

Posted by Benedict <be...@apache.org>.
Why not introduce a general table option that toggles ALLOW FILTERING
behaviour and just flip it for virtual tables we want this behaviour for?
Users can do it too, for their own tables for which it’s suitable.

  

> On 3 Feb 2023, at 20:59, Andrés de la Peña <ad...@apache.org> wrote:  
>  
>

> 
>
> For those eventual big virtual tables there is the mentioned flag indicating
> whether the table allows filtering without AF.
>
>  
>
>
> I guess the question is how can a user know whether a certain virtual table
> is one of the big ones. That could be specified in the doc for each table,
> and it could also be included in the table properties, so it's displayed by
> DESCRIBE TABLE queries.  
>
>
>  
>
>
> On Fri, 3 Feb 2023 at 20:56, Chris Lohfink
> <[clohfink85@gmail.com](mailto:clohfink85@gmail.com)> wrote:  
>
>

>> Just to 2nd what Scott days. While everything is in memory now, it may not
be in the future, and if we add it implicitly, we are tying ourselves to be in
memory only. However, I wouldn't -1 the idea.  
>  
> Another option may be a cqlsh option (ie like expand on/off) to always
> include a flag so it doesnt need to be added or something.  
>
>>

>>  
>
>>

>> Chris

>>

>>  
>
>>

>> On Fri, Feb 3, 2023 at 1:24 PM C. Scott Andreas
<[scott@paradoxica.net](mailto:scott@paradoxica.net)> wrote:  
>
>>

>>> There are some ideas that development community members have kicked around
that may falsify the assumption that "virtual tables are tiny and will fit in
memory."  
>
>>>

>>>  
>
>>>

>>> One example is CASSANDRA-14629: Abstract Virtual Table for very large
result sets  
>
>>>

>>> <https://issues.apache.org/jira/browse/CASSANDRA-14629>  
>
>>>

>>>  
>
>>>

>>> Chris's proposal here is to enable query results from virtual tables to be
streamed to the client rather than being fully materialized. There are some
neat possibilities suggested in this ticket, such as debug functionality to
dump the contents of a raw SSTable via the CQL interface, or the contents of
the database's internal caches. One could also imagine a feature like this
providing functionality similar to a foreign data wrapper in other databases.  
>
>>>

>>>  
>
>>>

>>> I don't think the assumption that "virtual tables will always be small and
always fit in memory" is a safe one.  
>
>>>

>>>  
>
>>>

>>> I don't think we should implicitly add "ALLOW FILTERING" to all queries
against virtual tables because of this, in addition to concern with departing
from standard CQL semantics for a type of tables deemed special.  
>
>>>

>>>  
>
>>>

>>> – Scott  
>
>>>

>>>  
>
>>>

>>>> On Feb 3, 2023, at 6:52 AM, Maxim Muzafarov
<[mmuzaf@apache.org](mailto:mmuzaf@apache.org)> wrote:  
>
>>>>

>>>>  
>
>>>>

>>>>  
>
>>>>

>>>> Hello Stefan,  
>
>>>>

>>>>  
>
>>>>

>>>> Regarding the decision to implicitly enable ALLOW FILTERING for  
>
>>>>

>>>> virtual tables, which also makes sense to me, it may be necessary to  
>
>>>>

>>>> consider changing the clustering columns in the virtual table metadata  
>
>>>>

>>>> to regular columns as well. The reasons are the same as mentioned  
>
>>>>

>>>> earlier: the virtual tables hold their data in memory, thus we do not  
>
>>>>

>>>> benefit from the advantages of ordered data (e.g. the ClientsTable and  
>
>>>>

>>>> its ClusteringColumn(PORT)).  
>
>>>>

>>>>  
>
>>>>

>>>> Changing the clustering column to a regular column may simplify the  
>
>>>>

>>>> virtual table data model, but I'm afraid it may affect users who rely  
>
>>>>

>>>> on the table metadata.  
>
>>>>

>>>>  
>
>>>>

>>>>  
>
>>>>

>>>>  
>
>>>>

>>>> On Fri, 3 Feb 2023 at 12:32, Andrés de la Peña
<[adelapena@apache.org](mailto:adelapena@apache.org)> wrote:  
>
>>>>

>>>>>  
>
>>>>>

>>>>> I think removing the need for ALLOW FILTERING on virtual tables makes
sense and would be quite useful for operators.  
>
>>>>>

>>>>>  
>
>>>>>

>>>>> That guard exists for performance issues that shouldn't occur on virtual
tables. We also have a flag in case some future virtual table implementation
has limitations regarding filtering, although it seems it's not the case with
any of the existing virtual tables.  
>
>>>>>

>>>>>  
>
>>>>>

>>>>> It is not like we would promote bad habits because virtual tables are
meant to be queried by operators / administrators only.  
>
>>>>>

>>>>>  
>
>>>>>

>>>>>  
>
>>>>>

>>>>> It might even be quite the opposite, since in the current situation
users might get used to routinely use ALLOW FILTERING for querying their
virtual tables.  
>
>>>>>

>>>>>  
>
>>>>>

>>>>> It has been mentioned on the #cassandra-dev Slack thread where this
started (1) that it's kind of an API inconsistency to allow querying by non-
primary keys on virtual tables without ALLOW FILTERING, whereas it's required
for regular tables. I think that a simply doc update saying that virtual
tables, which are not regular tables, support filtering would be enough.
Virtual tables are well identified by both the keyspace they belong to and
doc, so users shouldn't have trouble knowing whether a table is virtual. It
would be similar to the current exception for ALLOW FILTERING, where one needs
to use it unless the table has an index for the queried column.  
>
>>>>>

>>>>>  
>
>>>>>

>>>>> (1) <https://the-asf.slack.com/archives/CK23JSY2K/p1675352759267329>  
>
>>>>>

>>>>>  
>
>>>>>

>>>>> On Fri, 3 Feb 2023 at 09:09, Miklosovic, Stefan
<[Stefan.Miklosovic@netapp.com](mailto:Stefan.Miklosovic@netapp.com)> wrote:  
>
>>>>>

>>>>>>  
>
>>>>>>

>>>>>> Hi list,  
>
>>>>>>

>>>>>>  
>
>>>>>>

>>>>>> the content of virtual tables is held in memory (and / or is fetched
every time upon request). While doing queries against such table for a column
outside of primary key, normally, users are required to specify ALLOW
FILTERING. This makes total sense for "ordinary tables" for applications to
have performant and effective queries but it kinds of loses the applicability
for virtual tables when it literally holds just handful of entries in memory
and it just does not matter, does it?  
>
>>>>>>

>>>>>>  
>
>>>>>>

>>>>>> What do you think about implicitly allowing filtering for virtual
tables so we save ourselves from these pesky errors when we want to query
arbitrary column and we need to satisfy CQL spec just to do that?  
>
>>>>>>

>>>>>>  
>
>>>>>>

>>>>>> It is not like we would promote bad habits because virtual tables are
meant to be queried by operators / administrators only.  
>
>>>>>>

>>>>>>  
>
>>>>>>

>>>>>> We can also explicitly document this behavior.  
>
>>>>>>

>>>>>>  
>
>>>>>>

>>>>>> Among other options, we may try to implement secondary indices on
virtual tables but I am not completely sure this is what we want because its
complexity etc. Is it even necessary to put such complex logic in place just
to be able to select any column on few entries in memory?  
>
>>>>>>

>>>>>>  
>
>>>>>>

>>>>>> I put together a draft here (1). It would be ever possible to
implicitly allow filtering on virtual tables only and it would be
implementator's responsibility to decide that, per table.  
>
>>>>>>

>>>>>>  
>
>>>>>>

>>>>>> For all virtual tables we currently have, I would enable this
everywhere. I do not think there is any virtual table where we would not want
to enable it or where people HAVE TO specify that.  
>
>>>>>>

>>>>>>  
>
>>>>>>

>>>>>> (1) <https://github.com/apache/cassandra/pull/2131>  
>
>>>

>>>  
>
>>>

>>>  
>


Re: Implicitly enabling ALLOW FILTERING on virtual tables

Posted by Chris Lohfink <cl...@gmail.com>.
Yes, I am not -1. Just that if we do it we should be ok in the future with
some virtual tables that did not have this behavior. Should consider if
this would be confusing. Really should be ok imho since they just would get
the "need allow filtering" error on said future tables.

Chris

On Fri, Feb 3, 2023 at 2:59 PM Andrés de la Peña <ad...@apache.org>
wrote:

> For those eventual big virtual tables there is the mentioned flag
> indicating whether the table allows filtering without AF.
>
> I guess the question is how can a user know whether a certain virtual
> table is one of the big ones. That could be specified in the doc for each
> table, and it could also be included in the table properties, so it's
> displayed by DESCRIBE TABLE queries.
>
> On Fri, 3 Feb 2023 at 20:56, Chris Lohfink <cl...@gmail.com> wrote:
>
>> Just to 2nd what Scott days. While everything is in memory now, it may
>> not be in the future, and if we add it implicitly, we are tying ourselves
>> to be in memory only. However, I wouldn't -1 the idea.
>>
>> Another option may be a cqlsh option (ie like expand on/off) to always
>> include a flag so it doesnt need to be added or something.
>>
>> Chris
>>
>> On Fri, Feb 3, 2023 at 1:24 PM C. Scott Andreas <sc...@paradoxica.net>
>> wrote:
>>
>>> There are some ideas that development community members have kicked
>>> around that may falsify the assumption that "virtual tables are tiny and
>>> will fit in memory."
>>>
>>> One example is CASSANDRA-14629: Abstract Virtual Table for very large
>>> result sets
>>> https://issues.apache.org/jira/browse/CASSANDRA-14629
>>>
>>> Chris's proposal here is to enable query results from virtual tables to
>>> be streamed to the client rather than being fully materialized. There are
>>> some neat possibilities suggested in this ticket, such as debug
>>> functionality to dump the contents of a raw SSTable via the CQL interface,
>>> or the contents of the database's internal caches. One could also imagine a
>>> feature like this providing functionality similar to a foreign data wrapper
>>> in other databases.
>>>
>>> I don't think the assumption that "virtual tables will always be small
>>> and always fit in memory" is a safe one.
>>>
>>> I don't think we should implicitly add "ALLOW FILTERING" to all queries
>>> against virtual tables because of this, in addition to concern with
>>> departing from standard CQL semantics for a type of tables deemed special.
>>>
>>> – Scott
>>>
>>> On Feb 3, 2023, at 6:52 AM, Maxim Muzafarov <mm...@apache.org> wrote:
>>>
>>>
>>> Hello Stefan,
>>>
>>> Regarding the decision to implicitly enable ALLOW FILTERING for
>>> virtual tables, which also makes sense to me, it may be necessary to
>>> consider changing the clustering columns in the virtual table metadata
>>> to regular columns as well. The reasons are the same as mentioned
>>> earlier: the virtual tables hold their data in memory, thus we do not
>>> benefit from the advantages of ordered data (e.g. the ClientsTable and
>>> its ClusteringColumn(PORT)).
>>>
>>> Changing the clustering column to a regular column may simplify the
>>> virtual table data model, but I'm afraid it may affect users who rely
>>> on the table metadata.
>>>
>>>
>>>
>>> On Fri, 3 Feb 2023 at 12:32, Andrés de la Peña <ad...@apache.org>
>>> wrote:
>>>
>>>
>>> I think removing the need for ALLOW FILTERING on virtual tables makes
>>> sense and would be quite useful for operators.
>>>
>>> That guard exists for performance issues that shouldn't occur on virtual
>>> tables. We also have a flag in case some future virtual table
>>> implementation has limitations regarding filtering, although it seems it's
>>> not the case with any of the existing virtual tables.
>>>
>>> It is not like we would promote bad habits because virtual tables are
>>> meant to be queried by operators / administrators only.
>>>
>>>
>>> It might even be quite the opposite, since in the current situation
>>> users might get used to routinely use ALLOW FILTERING for querying their
>>> virtual tables.
>>>
>>> It has been mentioned on the #cassandra-dev Slack thread where this
>>> started (1) that it's kind of an API inconsistency to allow querying by
>>> non-primary keys on virtual tables without ALLOW FILTERING, whereas it's
>>> required for regular tables. I think that a simply doc update saying that
>>> virtual tables, which are not regular tables, support filtering would be
>>> enough. Virtual tables are well identified by both the keyspace they belong
>>> to and doc, so users shouldn't have trouble knowing whether a table is
>>> virtual. It would be similar to the current exception for ALLOW FILTERING,
>>> where one needs to use it unless the table has an index for the queried
>>> column.
>>>
>>> (1) https://the-asf.slack.com/archives/CK23JSY2K/p1675352759267329
>>>
>>> On Fri, 3 Feb 2023 at 09:09, Miklosovic, Stefan <
>>> Stefan.Miklosovic@netapp.com> wrote:
>>>
>>>
>>> Hi list,
>>>
>>> the content of virtual tables is held in memory (and / or is fetched
>>> every time upon request). While doing queries against such table for a
>>> column outside of primary key, normally, users are required to specify
>>> ALLOW FILTERING. This makes total sense for "ordinary tables" for
>>> applications to have performant and effective queries but it kinds of loses
>>> the applicability for virtual tables when it literally holds just handful
>>> of entries in memory and it just does not matter, does it?
>>>
>>> What do you think about implicitly allowing filtering for virtual tables
>>> so we save ourselves from these pesky errors when we want to query
>>> arbitrary column and we need to satisfy CQL spec just to do that?
>>>
>>> It is not like we would promote bad habits because virtual tables are
>>> meant to be queried by operators / administrators only.
>>>
>>> We can also explicitly document this behavior.
>>>
>>> Among other options, we may try to implement secondary indices on
>>> virtual tables but I am not completely sure this is what we want because
>>> its complexity etc. Is it even necessary to put such complex logic in place
>>> just to be able to select any column on few entries in memory?
>>>
>>> I put together a draft here (1). It would be ever possible to implicitly
>>> allow filtering on virtual tables only and it would be implementator's
>>> responsibility to decide that, per table.
>>>
>>> For all virtual tables we currently have, I would enable this
>>> everywhere. I do not think there is any virtual table where we would not
>>> want to enable it or where people HAVE TO specify that.
>>>
>>> (1) https://github.com/apache/cassandra/pull/2131
>>>
>>>
>>>
>>>

Re: Implicitly enabling ALLOW FILTERING on virtual tables

Posted by Andrés de la Peña <ad...@apache.org>.
For those eventual big virtual tables there is the mentioned flag
indicating whether the table allows filtering without AF.

I guess the question is how can a user know whether a certain virtual table
is one of the big ones. That could be specified in the doc for each table,
and it could also be included in the table properties, so it's displayed by
DESCRIBE TABLE queries.

On Fri, 3 Feb 2023 at 20:56, Chris Lohfink <cl...@gmail.com> wrote:

> Just to 2nd what Scott days. While everything is in memory now, it may not
> be in the future, and if we add it implicitly, we are tying ourselves to be
> in memory only. However, I wouldn't -1 the idea.
>
> Another option may be a cqlsh option (ie like expand on/off) to always
> include a flag so it doesnt need to be added or something.
>
> Chris
>
> On Fri, Feb 3, 2023 at 1:24 PM C. Scott Andreas <sc...@paradoxica.net>
> wrote:
>
>> There are some ideas that development community members have kicked
>> around that may falsify the assumption that "virtual tables are tiny and
>> will fit in memory."
>>
>> One example is CASSANDRA-14629: Abstract Virtual Table for very large
>> result sets
>> https://issues.apache.org/jira/browse/CASSANDRA-14629
>>
>> Chris's proposal here is to enable query results from virtual tables to
>> be streamed to the client rather than being fully materialized. There are
>> some neat possibilities suggested in this ticket, such as debug
>> functionality to dump the contents of a raw SSTable via the CQL interface,
>> or the contents of the database's internal caches. One could also imagine a
>> feature like this providing functionality similar to a foreign data wrapper
>> in other databases.
>>
>> I don't think the assumption that "virtual tables will always be small
>> and always fit in memory" is a safe one.
>>
>> I don't think we should implicitly add "ALLOW FILTERING" to all queries
>> against virtual tables because of this, in addition to concern with
>> departing from standard CQL semantics for a type of tables deemed special.
>>
>> – Scott
>>
>> On Feb 3, 2023, at 6:52 AM, Maxim Muzafarov <mm...@apache.org> wrote:
>>
>>
>> Hello Stefan,
>>
>> Regarding the decision to implicitly enable ALLOW FILTERING for
>> virtual tables, which also makes sense to me, it may be necessary to
>> consider changing the clustering columns in the virtual table metadata
>> to regular columns as well. The reasons are the same as mentioned
>> earlier: the virtual tables hold their data in memory, thus we do not
>> benefit from the advantages of ordered data (e.g. the ClientsTable and
>> its ClusteringColumn(PORT)).
>>
>> Changing the clustering column to a regular column may simplify the
>> virtual table data model, but I'm afraid it may affect users who rely
>> on the table metadata.
>>
>>
>>
>> On Fri, 3 Feb 2023 at 12:32, Andrés de la Peña <ad...@apache.org>
>> wrote:
>>
>>
>> I think removing the need for ALLOW FILTERING on virtual tables makes
>> sense and would be quite useful for operators.
>>
>> That guard exists for performance issues that shouldn't occur on virtual
>> tables. We also have a flag in case some future virtual table
>> implementation has limitations regarding filtering, although it seems it's
>> not the case with any of the existing virtual tables.
>>
>> It is not like we would promote bad habits because virtual tables are
>> meant to be queried by operators / administrators only.
>>
>>
>> It might even be quite the opposite, since in the current situation users
>> might get used to routinely use ALLOW FILTERING for querying their virtual
>> tables.
>>
>> It has been mentioned on the #cassandra-dev Slack thread where this
>> started (1) that it's kind of an API inconsistency to allow querying by
>> non-primary keys on virtual tables without ALLOW FILTERING, whereas it's
>> required for regular tables. I think that a simply doc update saying that
>> virtual tables, which are not regular tables, support filtering would be
>> enough. Virtual tables are well identified by both the keyspace they belong
>> to and doc, so users shouldn't have trouble knowing whether a table is
>> virtual. It would be similar to the current exception for ALLOW FILTERING,
>> where one needs to use it unless the table has an index for the queried
>> column.
>>
>> (1) https://the-asf.slack.com/archives/CK23JSY2K/p1675352759267329
>>
>> On Fri, 3 Feb 2023 at 09:09, Miklosovic, Stefan <
>> Stefan.Miklosovic@netapp.com> wrote:
>>
>>
>> Hi list,
>>
>> the content of virtual tables is held in memory (and / or is fetched
>> every time upon request). While doing queries against such table for a
>> column outside of primary key, normally, users are required to specify
>> ALLOW FILTERING. This makes total sense for "ordinary tables" for
>> applications to have performant and effective queries but it kinds of loses
>> the applicability for virtual tables when it literally holds just handful
>> of entries in memory and it just does not matter, does it?
>>
>> What do you think about implicitly allowing filtering for virtual tables
>> so we save ourselves from these pesky errors when we want to query
>> arbitrary column and we need to satisfy CQL spec just to do that?
>>
>> It is not like we would promote bad habits because virtual tables are
>> meant to be queried by operators / administrators only.
>>
>> We can also explicitly document this behavior.
>>
>> Among other options, we may try to implement secondary indices on virtual
>> tables but I am not completely sure this is what we want because its
>> complexity etc. Is it even necessary to put such complex logic in place
>> just to be able to select any column on few entries in memory?
>>
>> I put together a draft here (1). It would be ever possible to implicitly
>> allow filtering on virtual tables only and it would be implementator's
>> responsibility to decide that, per table.
>>
>> For all virtual tables we currently have, I would enable this everywhere.
>> I do not think there is any virtual table where we would not want to enable
>> it or where people HAVE TO specify that.
>>
>> (1) https://github.com/apache/cassandra/pull/2131
>>
>>
>>
>>

Re: Implicitly enabling ALLOW FILTERING on virtual tables

Posted by Chris Lohfink <cl...@gmail.com>.
Just to 2nd what Scott days. While everything is in memory now, it may not
be in the future, and if we add it implicitly, we are tying ourselves to be
in memory only. However, I wouldn't -1 the idea.

Another option may be a cqlsh option (ie like expand on/off) to always
include a flag so it doesnt need to be added or something.

Chris

On Fri, Feb 3, 2023 at 1:24 PM C. Scott Andreas <sc...@paradoxica.net>
wrote:

> There are some ideas that development community members have kicked around
> that may falsify the assumption that "virtual tables are tiny and will fit
> in memory."
>
> One example is CASSANDRA-14629: Abstract Virtual Table for very large
> result sets
> https://issues.apache.org/jira/browse/CASSANDRA-14629
>
> Chris's proposal here is to enable query results from virtual tables to be
> streamed to the client rather than being fully materialized. There are some
> neat possibilities suggested in this ticket, such as debug functionality to
> dump the contents of a raw SSTable via the CQL interface, or the contents
> of the database's internal caches. One could also imagine a feature like
> this providing functionality similar to a foreign data wrapper in other
> databases.
>
> I don't think the assumption that "virtual tables will always be small and
> always fit in memory" is a safe one.
>
> I don't think we should implicitly add "ALLOW FILTERING" to all queries
> against virtual tables because of this, in addition to concern with
> departing from standard CQL semantics for a type of tables deemed special.
>
> – Scott
>
> On Feb 3, 2023, at 6:52 AM, Maxim Muzafarov <mm...@apache.org> wrote:
>
>
> Hello Stefan,
>
> Regarding the decision to implicitly enable ALLOW FILTERING for
> virtual tables, which also makes sense to me, it may be necessary to
> consider changing the clustering columns in the virtual table metadata
> to regular columns as well. The reasons are the same as mentioned
> earlier: the virtual tables hold their data in memory, thus we do not
> benefit from the advantages of ordered data (e.g. the ClientsTable and
> its ClusteringColumn(PORT)).
>
> Changing the clustering column to a regular column may simplify the
> virtual table data model, but I'm afraid it may affect users who rely
> on the table metadata.
>
>
>
> On Fri, 3 Feb 2023 at 12:32, Andrés de la Peña <ad...@apache.org>
> wrote:
>
>
> I think removing the need for ALLOW FILTERING on virtual tables makes
> sense and would be quite useful for operators.
>
> That guard exists for performance issues that shouldn't occur on virtual
> tables. We also have a flag in case some future virtual table
> implementation has limitations regarding filtering, although it seems it's
> not the case with any of the existing virtual tables.
>
> It is not like we would promote bad habits because virtual tables are
> meant to be queried by operators / administrators only.
>
>
> It might even be quite the opposite, since in the current situation users
> might get used to routinely use ALLOW FILTERING for querying their virtual
> tables.
>
> It has been mentioned on the #cassandra-dev Slack thread where this
> started (1) that it's kind of an API inconsistency to allow querying by
> non-primary keys on virtual tables without ALLOW FILTERING, whereas it's
> required for regular tables. I think that a simply doc update saying that
> virtual tables, which are not regular tables, support filtering would be
> enough. Virtual tables are well identified by both the keyspace they belong
> to and doc, so users shouldn't have trouble knowing whether a table is
> virtual. It would be similar to the current exception for ALLOW FILTERING,
> where one needs to use it unless the table has an index for the queried
> column.
>
> (1) https://the-asf.slack.com/archives/CK23JSY2K/p1675352759267329
>
> On Fri, 3 Feb 2023 at 09:09, Miklosovic, Stefan <
> Stefan.Miklosovic@netapp.com> wrote:
>
>
> Hi list,
>
> the content of virtual tables is held in memory (and / or is fetched every
> time upon request). While doing queries against such table for a column
> outside of primary key, normally, users are required to specify ALLOW
> FILTERING. This makes total sense for "ordinary tables" for applications to
> have performant and effective queries but it kinds of loses the
> applicability for virtual tables when it literally holds just handful of
> entries in memory and it just does not matter, does it?
>
> What do you think about implicitly allowing filtering for virtual tables
> so we save ourselves from these pesky errors when we want to query
> arbitrary column and we need to satisfy CQL spec just to do that?
>
> It is not like we would promote bad habits because virtual tables are
> meant to be queried by operators / administrators only.
>
> We can also explicitly document this behavior.
>
> Among other options, we may try to implement secondary indices on virtual
> tables but I am not completely sure this is what we want because its
> complexity etc. Is it even necessary to put such complex logic in place
> just to be able to select any column on few entries in memory?
>
> I put together a draft here (1). It would be ever possible to implicitly
> allow filtering on virtual tables only and it would be implementator's
> responsibility to decide that, per table.
>
> For all virtual tables we currently have, I would enable this everywhere.
> I do not think there is any virtual table where we would not want to enable
> it or where people HAVE TO specify that.
>
> (1) https://github.com/apache/cassandra/pull/2131
>
>
>
>

Re: Implicitly enabling ALLOW FILTERING on virtual tables

Posted by "Miklosovic, Stefan" <St...@netapp.com>.
Thanks everybody for the input. I created the ticket (1) to track the work (2).

Lets move the further discussion there.

(1) https://issues.apache.org/jira/browse/CASSANDRA-18238
(1) https://github.com/apache/cassandra/pull/2142/files

________________________________________
From: Aleksey Yeshchenko <al...@apple.com>
Sent: Monday, February 6, 2023 12:11
To: dev@cassandra.apache.org
Subject: Re: Implicitly enabling ALLOW FILTERING on virtual tables

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.



Just make virtual table implementations decide?

Add a method to VirtualTable interface to indicate if this is desirable, and call it a day?

On 6 Feb 2023, at 09:41, Benjamin Lerer <b....@gmail.com> wrote:

Making ALLOW FILTERING a table option implies giving the right to the person creating the table the ability to change the way the server will behave for that table which might not be something that every C* operator wants. Of course we can allow operators to controle that through the ALLOW FILTERING guardrail. At that point we would also need to have a default setting for the entire database.

Le ven. 3 févr. 2023 à 23:44, Miklosovic, Stefan <St...@netapp.com>> a écrit :
This is the draft for FILTERING ON|OFF in shell.

I would say this is the most simple solution.

We may still consider table option but what do you think about having it simply just set via shell?

https://github.com/apache/cassandra/pull/2141/files

________________________________________
From: Josh McKenzie <jm...@apache.org>>
Sent: Friday, February 3, 2023 23:39
To: dev
Subject: Re: Implicitly enabling ALLOW FILTERING on virtual tables

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.



they would start to set ALLOW FILTERING here and there in order to not think twice about their data model so they can just call it a day.
Setting this on a per-table basis or having users set this on specific queries that hit tables and forgetting they set it are 6 of one and half-a-dozen of another.

I like the table property idea personally. That communicates an intent about the data model and expectation of the size and usage of data in the modeling of the schema that embeds some context and intent there's currently no mechanism to communicate.

On Fri, Feb 3, 2023, at 5:00 PM, Miklosovic, Stefan wrote:
Yes, there would be discrepancy. I do not like that either. If it was only about "normal tables vs virtual tables", I could live with that. But the fact that there are going to be differences among vtables themselves, that starts to be a little bit messy. Then we would need to let operators know what tables are always allowed to be filtered on and which do not and that just complicates it. Putting that information to comment so it is visible in DECSCRIBE is nice idea.

That flag we talk about ... that flag would be used purely internally, it would not be in schema to be gossiped.

Also, I am starting to like the suggestion to have something like ALLOW FILTERING ON in CQLSH so it would be turned on whole CQL session. That leaves tables as they are and it should not be a big deal for operators to set. We would have to make sure to add "ALLOW FILTERING" clause to every SELECT statement (to virtual tables only?) a user submits. I am not sure if this is doable yet though.

________________________________________
From: David Capwell <dc...@apple.com>>>
Sent: Friday, February 3, 2023 22:42
To: dev
Cc: Maxim Muzafarov
Subject: Re: Implicitly enabling ALLOW FILTERING on virtual tables

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.



I don't think the assumption that "virtual tables will always be small and always fit in memory" is a safe one.

Agree, there is a repair ticket to have the coordinating node do network queries to peers to resolve the table (rather than operator querying everything, allow the coordinator node to do it for you)… so this assumption may not be true down the line.

I could be open to a table property that says ALLOW FILTERING on by default or not… then we can pick and choose vtables (or have vtables opt-out)…. I kinda like like the lack of consistency with this approach though

On Feb 3, 2023, at 11:24 AM, C. Scott Andreas <sc...@paradoxica.net>>> wrote:

There are some ideas that development community members have kicked around that may falsify the assumption that "virtual tables are tiny and will fit in memory."

One example is CASSANDRA-14629: Abstract Virtual Table for very large result sets
https://issues.apache.org/jira/browse/CASSANDRA-14629

Chris's proposal here is to enable query results from virtual tables to be streamed to the client rather than being fully materialized. There are some neat possibilities suggested in this ticket, such as debug functionality to dump the contents of a raw SSTable via the CQL interface, or the contents of the database's internal caches. One could also imagine a feature like this providing functionality similar to a foreign data wrapper in other databases.

I don't think the assumption that "virtual tables will always be small and always fit in memory" is a safe one.

I don't think we should implicitly add "ALLOW FILTERING" to all queries against virtual tables because of this, in addition to concern with departing from standard CQL semantics for a type of tables deemed special.

– Scott

On Feb 3, 2023, at 6:52 AM, Maxim Muzafarov <mm...@apache.org>>> wrote:


Hello Stefan,

Regarding the decision to implicitly enable ALLOW FILTERING for
virtual tables, which also makes sense to me, it may be necessary to
consider changing the clustering columns in the virtual table metadata
to regular columns as well. The reasons are the same as mentioned
earlier: the virtual tables hold their data in memory, thus we do not
benefit from the advantages of ordered data (e.g. the ClientsTable and
its ClusteringColumn(PORT)).

Changing the clustering column to a regular column may simplify the
virtual table data model, but I'm afraid it may affect users who rely
on the table metadata.



On Fri, 3 Feb 2023 at 12:32, Andrés de la Peña <ad...@apache.org>>> wrote:

I think removing the need for ALLOW FILTERING on virtual tables makes sense and would be quite useful for operators.

That guard exists for performance issues that shouldn't occur on virtual tables. We also have a flag in case some future virtual table implementation has limitations regarding filtering, although it seems it's not the case with any of the existing virtual tables.

It is not like we would promote bad habits because virtual tables are meant to be queried by operators / administrators only.


It might even be quite the opposite, since in the current situation users might get used to routinely use ALLOW FILTERING for querying their virtual tables.

It has been mentioned on the #cassandra-dev Slack thread where this started (1) that it's kind of an API inconsistency to allow querying by non-primary keys on virtual tables without ALLOW FILTERING, whereas it's required for regular tables. I think that a simply doc update saying that virtual tables, which are not regular tables, support filtering would be enough. Virtual tables are well identified by both the keyspace they belong to and doc, so users shouldn't have trouble knowing whether a table is virtual. It would be similar to the current exception for ALLOW FILTERING, where one needs to use it unless the table has an index for the queried column.

(1) https://the-asf.slack.com/archives/CK23JSY2K/p1675352759267329

On Fri, 3 Feb 2023 at 09:09, Miklosovic, Stefan <St...@netapp.com>>> wrote:

Hi list,

the content of virtual tables is held in memory (and / or is fetched every time upon request). While doing queries against such table for a column outside of primary key, normally, users are required to specify ALLOW FILTERING. This makes total sense for "ordinary tables" for applications to have performant and effective queries but it kinds of loses the applicability for virtual tables when it literally holds just handful of entries in memory and it just does not matter, does it?

What do you think about implicitly allowing filtering for virtual tables so we save ourselves from these pesky errors when we want to query arbitrary column and we need to satisfy CQL spec just to do that?

It is not like we would promote bad habits because virtual tables are meant to be queried by operators / administrators only.

We can also explicitly document this behavior.

Among other options, we may try to implement secondary indices on virtual tables but I am not completely sure this is what we want because its complexity etc. Is it even necessary to put such complex logic in place just to be able to select any column on few entries in memory?

I put together a draft here (1). It would be ever possible to implicitly allow filtering on virtual tables only and it would be implementator's responsibility to decide that, per table.

For all virtual tables we currently have, I would enable this everywhere. I do not think there is any virtual table where we would not want to enable it or where people HAVE TO specify that.

(1) https://github.com/apache/cassandra/pull/2131







Re: Implicitly enabling ALLOW FILTERING on virtual tables

Posted by Aleksey Yeshchenko <al...@apple.com>.
Just make virtual table implementations decide?

Add a method to VirtualTable interface to indicate if this is desirable, and call it a day? 

> On 6 Feb 2023, at 09:41, Benjamin Lerer <b....@gmail.com> wrote:
> 
> Making ALLOW FILTERING a table option implies giving the right to the person creating the table the ability to change the way the server will behave for that table which might not be something that every C* operator wants. Of course we can allow operators to controle that through the ALLOW FILTERING guardrail. At that point we would also need to have a default setting for the entire database.
> 
> Le ven. 3 févr. 2023 à 23:44, Miklosovic, Stefan <Stefan.Miklosovic@netapp.com <ma...@netapp.com>> a écrit :
>> This is the draft for FILTERING ON|OFF in shell.
>> 
>> I would say this is the most simple solution.
>> 
>> We may still consider table option but what do you think about having it simply just set via shell?
>> 
>> https://github.com/apache/cassandra/pull/2141/files
>> 
>> ________________________________________
>> From: Josh McKenzie <jmckenzie@apache.org <ma...@apache.org>>
>> Sent: Friday, February 3, 2023 23:39
>> To: dev
>> Subject: Re: Implicitly enabling ALLOW FILTERING on virtual tables
>> 
>> NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>> 
>> 
>> 
>> they would start to set ALLOW FILTERING here and there in order to not think twice about their data model so they can just call it a day.
>> Setting this on a per-table basis or having users set this on specific queries that hit tables and forgetting they set it are 6 of one and half-a-dozen of another.
>> 
>> I like the table property idea personally. That communicates an intent about the data model and expectation of the size and usage of data in the modeling of the schema that embeds some context and intent there's currently no mechanism to communicate.
>> 
>> On Fri, Feb 3, 2023, at 5:00 PM, Miklosovic, Stefan wrote:
>> Yes, there would be discrepancy. I do not like that either. If it was only about "normal tables vs virtual tables", I could live with that. But the fact that there are going to be differences among vtables themselves, that starts to be a little bit messy. Then we would need to let operators know what tables are always allowed to be filtered on and which do not and that just complicates it. Putting that information to comment so it is visible in DECSCRIBE is nice idea.
>> 
>> That flag we talk about ... that flag would be used purely internally, it would not be in schema to be gossiped.
>> 
>> Also, I am starting to like the suggestion to have something like ALLOW FILTERING ON in CQLSH so it would be turned on whole CQL session. That leaves tables as they are and it should not be a big deal for operators to set. We would have to make sure to add "ALLOW FILTERING" clause to every SELECT statement (to virtual tables only?) a user submits. I am not sure if this is doable yet though.
>> 
>> ________________________________________
>> From: David Capwell <dcapwell@apple.com <ma...@apple.com><mailto:dcapwell@apple.com <ma...@apple.com>>>
>> Sent: Friday, February 3, 2023 22:42
>> To: dev
>> Cc: Maxim Muzafarov
>> Subject: Re: Implicitly enabling ALLOW FILTERING on virtual tables
>> 
>> NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>> 
>> 
>> 
>> I don't think the assumption that "virtual tables will always be small and always fit in memory" is a safe one.
>> 
>> Agree, there is a repair ticket to have the coordinating node do network queries to peers to resolve the table (rather than operator querying everything, allow the coordinator node to do it for you)… so this assumption may not be true down the line.
>> 
>> I could be open to a table property that says ALLOW FILTERING on by default or not… then we can pick and choose vtables (or have vtables opt-out)…. I kinda like like the lack of consistency with this approach though
>> 
>> On Feb 3, 2023, at 11:24 AM, C. Scott Andreas <scott@paradoxica.net <ma...@paradoxica.net><mailto:scott@paradoxica.net <ma...@paradoxica.net>>> wrote:
>> 
>> There are some ideas that development community members have kicked around that may falsify the assumption that "virtual tables are tiny and will fit in memory."
>> 
>> One example is CASSANDRA-14629: Abstract Virtual Table for very large result sets
>> https://issues.apache.org/jira/browse/CASSANDRA-14629
>> 
>> Chris's proposal here is to enable query results from virtual tables to be streamed to the client rather than being fully materialized. There are some neat possibilities suggested in this ticket, such as debug functionality to dump the contents of a raw SSTable via the CQL interface, or the contents of the database's internal caches. One could also imagine a feature like this providing functionality similar to a foreign data wrapper in other databases.
>> 
>> I don't think the assumption that "virtual tables will always be small and always fit in memory" is a safe one.
>> 
>> I don't think we should implicitly add "ALLOW FILTERING" to all queries against virtual tables because of this, in addition to concern with departing from standard CQL semantics for a type of tables deemed special.
>> 
>> – Scott
>> 
>> On Feb 3, 2023, at 6:52 AM, Maxim Muzafarov <mmuzaf@apache.org <ma...@apache.org><mailto:mmuzaf@apache.org <ma...@apache.org>>> wrote:
>> 
>> 
>> Hello Stefan,
>> 
>> Regarding the decision to implicitly enable ALLOW FILTERING for
>> virtual tables, which also makes sense to me, it may be necessary to
>> consider changing the clustering columns in the virtual table metadata
>> to regular columns as well. The reasons are the same as mentioned
>> earlier: the virtual tables hold their data in memory, thus we do not
>> benefit from the advantages of ordered data (e.g. the ClientsTable and
>> its ClusteringColumn(PORT)).
>> 
>> Changing the clustering column to a regular column may simplify the
>> virtual table data model, but I'm afraid it may affect users who rely
>> on the table metadata.
>> 
>> 
>> 
>> On Fri, 3 Feb 2023 at 12:32, Andrés de la Peña <adelapena@apache.org <ma...@apache.org><mailto:adelapena@apache.org <ma...@apache.org>>> wrote:
>> 
>> I think removing the need for ALLOW FILTERING on virtual tables makes sense and would be quite useful for operators.
>> 
>> That guard exists for performance issues that shouldn't occur on virtual tables. We also have a flag in case some future virtual table implementation has limitations regarding filtering, although it seems it's not the case with any of the existing virtual tables.
>> 
>> It is not like we would promote bad habits because virtual tables are meant to be queried by operators / administrators only.
>> 
>> 
>> It might even be quite the opposite, since in the current situation users might get used to routinely use ALLOW FILTERING for querying their virtual tables.
>> 
>> It has been mentioned on the #cassandra-dev Slack thread where this started (1) that it's kind of an API inconsistency to allow querying by non-primary keys on virtual tables without ALLOW FILTERING, whereas it's required for regular tables. I think that a simply doc update saying that virtual tables, which are not regular tables, support filtering would be enough. Virtual tables are well identified by both the keyspace they belong to and doc, so users shouldn't have trouble knowing whether a table is virtual. It would be similar to the current exception for ALLOW FILTERING, where one needs to use it unless the table has an index for the queried column.
>> 
>> (1) https://the-asf.slack.com/archives/CK23JSY2K/p1675352759267329
>> 
>> On Fri, 3 Feb 2023 at 09:09, Miklosovic, Stefan <Stefan.Miklosovic@netapp.com <ma...@netapp.com><mailto:Stefan.Miklosovic@netapp.com <ma...@netapp.com>>> wrote:
>> 
>> Hi list,
>> 
>> the content of virtual tables is held in memory (and / or is fetched every time upon request). While doing queries against such table for a column outside of primary key, normally, users are required to specify ALLOW FILTERING. This makes total sense for "ordinary tables" for applications to have performant and effective queries but it kinds of loses the applicability for virtual tables when it literally holds just handful of entries in memory and it just does not matter, does it?
>> 
>> What do you think about implicitly allowing filtering for virtual tables so we save ourselves from these pesky errors when we want to query arbitrary column and we need to satisfy CQL spec just to do that?
>> 
>> It is not like we would promote bad habits because virtual tables are meant to be queried by operators / administrators only.
>> 
>> We can also explicitly document this behavior.
>> 
>> Among other options, we may try to implement secondary indices on virtual tables but I am not completely sure this is what we want because its complexity etc. Is it even necessary to put such complex logic in place just to be able to select any column on few entries in memory?
>> 
>> I put together a draft here (1). It would be ever possible to implicitly allow filtering on virtual tables only and it would be implementator's responsibility to decide that, per table.
>> 
>> For all virtual tables we currently have, I would enable this everywhere. I do not think there is any virtual table where we would not want to enable it or where people HAVE TO specify that.
>> 
>> (1) https://github.com/apache/cassandra/pull/2131
>> 
>> 
>> 
>> 
>> 


Re: Implicitly enabling ALLOW FILTERING on virtual tables

Posted by Benjamin Lerer <b....@gmail.com>.
Making ALLOW FILTERING a table option implies giving the right to the
person creating the table the ability to change the way the server will
behave for that table which might not be something that every C* operator
wants. Of course we can allow operators to controle that through the ALLOW
FILTERING guardrail. At that point we would also need to have a default
setting for the entire database.

Le ven. 3 févr. 2023 à 23:44, Miklosovic, Stefan <
Stefan.Miklosovic@netapp.com> a écrit :

> This is the draft for FILTERING ON|OFF in shell.
>
> I would say this is the most simple solution.
>
> We may still consider table option but what do you think about having it
> simply just set via shell?
>
> https://github.com/apache/cassandra/pull/2141/files
>
> ________________________________________
> From: Josh McKenzie <jm...@apache.org>
> Sent: Friday, February 3, 2023 23:39
> To: dev
> Subject: Re: Implicitly enabling ALLOW FILTERING on virtual tables
>
> NetApp Security WARNING: This is an external email. Do not click links or
> open attachments unless you recognize the sender and know the content is
> safe.
>
>
>
> they would start to set ALLOW FILTERING here and there in order to not
> think twice about their data model so they can just call it a day.
> Setting this on a per-table basis or having users set this on specific
> queries that hit tables and forgetting they set it are 6 of one and
> half-a-dozen of another.
>
> I like the table property idea personally. That communicates an intent
> about the data model and expectation of the size and usage of data in the
> modeling of the schema that embeds some context and intent there's
> currently no mechanism to communicate.
>
> On Fri, Feb 3, 2023, at 5:00 PM, Miklosovic, Stefan wrote:
> Yes, there would be discrepancy. I do not like that either. If it was only
> about "normal tables vs virtual tables", I could live with that. But the
> fact that there are going to be differences among vtables themselves, that
> starts to be a little bit messy. Then we would need to let operators know
> what tables are always allowed to be filtered on and which do not and that
> just complicates it. Putting that information to comment so it is visible
> in DECSCRIBE is nice idea.
>
> That flag we talk about ... that flag would be used purely internally, it
> would not be in schema to be gossiped.
>
> Also, I am starting to like the suggestion to have something like ALLOW
> FILTERING ON in CQLSH so it would be turned on whole CQL session. That
> leaves tables as they are and it should not be a big deal for operators to
> set. We would have to make sure to add "ALLOW FILTERING" clause to every
> SELECT statement (to virtual tables only?) a user submits. I am not sure if
> this is doable yet though.
>
> ________________________________________
> From: David Capwell <dc...@apple.com>>
> Sent: Friday, February 3, 2023 22:42
> To: dev
> Cc: Maxim Muzafarov
> Subject: Re: Implicitly enabling ALLOW FILTERING on virtual tables
>
> NetApp Security WARNING: This is an external email. Do not click links or
> open attachments unless you recognize the sender and know the content is
> safe.
>
>
>
> I don't think the assumption that "virtual tables will always be small and
> always fit in memory" is a safe one.
>
> Agree, there is a repair ticket to have the coordinating node do network
> queries to peers to resolve the table (rather than operator querying
> everything, allow the coordinator node to do it for you)… so this
> assumption may not be true down the line.
>
> I could be open to a table property that says ALLOW FILTERING on by
> default or not… then we can pick and choose vtables (or have vtables
> opt-out)…. I kinda like like the lack of consistency with this approach
> though
>
> On Feb 3, 2023, at 11:24 AM, C. Scott Andreas <scott@paradoxica.net
> <ma...@paradoxica.net>> wrote:
>
> There are some ideas that development community members have kicked around
> that may falsify the assumption that "virtual tables are tiny and will fit
> in memory."
>
> One example is CASSANDRA-14629: Abstract Virtual Table for very large
> result sets
> https://issues.apache.org/jira/browse/CASSANDRA-14629
>
> Chris's proposal here is to enable query results from virtual tables to be
> streamed to the client rather than being fully materialized. There are some
> neat possibilities suggested in this ticket, such as debug functionality to
> dump the contents of a raw SSTable via the CQL interface, or the contents
> of the database's internal caches. One could also imagine a feature like
> this providing functionality similar to a foreign data wrapper in other
> databases.
>
> I don't think the assumption that "virtual tables will always be small and
> always fit in memory" is a safe one.
>
> I don't think we should implicitly add "ALLOW FILTERING" to all queries
> against virtual tables because of this, in addition to concern with
> departing from standard CQL semantics for a type of tables deemed special.
>
> – Scott
>
> On Feb 3, 2023, at 6:52 AM, Maxim Muzafarov <mmuzaf@apache.org<mailto:
> mmuzaf@apache.org>> wrote:
>
>
> Hello Stefan,
>
> Regarding the decision to implicitly enable ALLOW FILTERING for
> virtual tables, which also makes sense to me, it may be necessary to
> consider changing the clustering columns in the virtual table metadata
> to regular columns as well. The reasons are the same as mentioned
> earlier: the virtual tables hold their data in memory, thus we do not
> benefit from the advantages of ordered data (e.g. the ClientsTable and
> its ClusteringColumn(PORT)).
>
> Changing the clustering column to a regular column may simplify the
> virtual table data model, but I'm afraid it may affect users who rely
> on the table metadata.
>
>
>
> On Fri, 3 Feb 2023 at 12:32, Andrés de la Peña <adelapena@apache.org
> <ma...@apache.org>> wrote:
>
> I think removing the need for ALLOW FILTERING on virtual tables makes
> sense and would be quite useful for operators.
>
> That guard exists for performance issues that shouldn't occur on virtual
> tables. We also have a flag in case some future virtual table
> implementation has limitations regarding filtering, although it seems it's
> not the case with any of the existing virtual tables.
>
> It is not like we would promote bad habits because virtual tables are
> meant to be queried by operators / administrators only.
>
>
> It might even be quite the opposite, since in the current situation users
> might get used to routinely use ALLOW FILTERING for querying their virtual
> tables.
>
> It has been mentioned on the #cassandra-dev Slack thread where this
> started (1) that it's kind of an API inconsistency to allow querying by
> non-primary keys on virtual tables without ALLOW FILTERING, whereas it's
> required for regular tables. I think that a simply doc update saying that
> virtual tables, which are not regular tables, support filtering would be
> enough. Virtual tables are well identified by both the keyspace they belong
> to and doc, so users shouldn't have trouble knowing whether a table is
> virtual. It would be similar to the current exception for ALLOW FILTERING,
> where one needs to use it unless the table has an index for the queried
> column.
>
> (1) https://the-asf.slack.com/archives/CK23JSY2K/p1675352759267329
>
> On Fri, 3 Feb 2023 at 09:09, Miklosovic, Stefan <
> Stefan.Miklosovic@netapp.com<ma...@netapp.com>> wrote:
>
> Hi list,
>
> the content of virtual tables is held in memory (and / or is fetched every
> time upon request). While doing queries against such table for a column
> outside of primary key, normally, users are required to specify ALLOW
> FILTERING. This makes total sense for "ordinary tables" for applications to
> have performant and effective queries but it kinds of loses the
> applicability for virtual tables when it literally holds just handful of
> entries in memory and it just does not matter, does it?
>
> What do you think about implicitly allowing filtering for virtual tables
> so we save ourselves from these pesky errors when we want to query
> arbitrary column and we need to satisfy CQL spec just to do that?
>
> It is not like we would promote bad habits because virtual tables are
> meant to be queried by operators / administrators only.
>
> We can also explicitly document this behavior.
>
> Among other options, we may try to implement secondary indices on virtual
> tables but I am not completely sure this is what we want because its
> complexity etc. Is it even necessary to put such complex logic in place
> just to be able to select any column on few entries in memory?
>
> I put together a draft here (1). It would be ever possible to implicitly
> allow filtering on virtual tables only and it would be implementator's
> responsibility to decide that, per table.
>
> For all virtual tables we currently have, I would enable this everywhere.
> I do not think there is any virtual table where we would not want to enable
> it or where people HAVE TO specify that.
>
> (1) https://github.com/apache/cassandra/pull/2131
>
>
>
>
>
>

Re: Implicitly enabling ALLOW FILTERING on virtual tables

Posted by "Miklosovic, Stefan" <St...@netapp.com>.
This is the draft for FILTERING ON|OFF in shell.

I would say this is the most simple solution.

We may still consider table option but what do you think about having it simply just set via shell?

https://github.com/apache/cassandra/pull/2141/files

________________________________________
From: Josh McKenzie <jm...@apache.org>
Sent: Friday, February 3, 2023 23:39
To: dev
Subject: Re: Implicitly enabling ALLOW FILTERING on virtual tables

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.



they would start to set ALLOW FILTERING here and there in order to not think twice about their data model so they can just call it a day.
Setting this on a per-table basis or having users set this on specific queries that hit tables and forgetting they set it are 6 of one and half-a-dozen of another.

I like the table property idea personally. That communicates an intent about the data model and expectation of the size and usage of data in the modeling of the schema that embeds some context and intent there's currently no mechanism to communicate.

On Fri, Feb 3, 2023, at 5:00 PM, Miklosovic, Stefan wrote:
Yes, there would be discrepancy. I do not like that either. If it was only about "normal tables vs virtual tables", I could live with that. But the fact that there are going to be differences among vtables themselves, that starts to be a little bit messy. Then we would need to let operators know what tables are always allowed to be filtered on and which do not and that just complicates it. Putting that information to comment so it is visible in DECSCRIBE is nice idea.

That flag we talk about ... that flag would be used purely internally, it would not be in schema to be gossiped.

Also, I am starting to like the suggestion to have something like ALLOW FILTERING ON in CQLSH so it would be turned on whole CQL session. That leaves tables as they are and it should not be a big deal for operators to set. We would have to make sure to add "ALLOW FILTERING" clause to every SELECT statement (to virtual tables only?) a user submits. I am not sure if this is doable yet though.

________________________________________
From: David Capwell <dc...@apple.com>>
Sent: Friday, February 3, 2023 22:42
To: dev
Cc: Maxim Muzafarov
Subject: Re: Implicitly enabling ALLOW FILTERING on virtual tables

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.



I don't think the assumption that "virtual tables will always be small and always fit in memory" is a safe one.

Agree, there is a repair ticket to have the coordinating node do network queries to peers to resolve the table (rather than operator querying everything, allow the coordinator node to do it for you)… so this assumption may not be true down the line.

I could be open to a table property that says ALLOW FILTERING on by default or not… then we can pick and choose vtables (or have vtables opt-out)…. I kinda like like the lack of consistency with this approach though

On Feb 3, 2023, at 11:24 AM, C. Scott Andreas <sc...@paradoxica.net>> wrote:

There are some ideas that development community members have kicked around that may falsify the assumption that "virtual tables are tiny and will fit in memory."

One example is CASSANDRA-14629: Abstract Virtual Table for very large result sets
https://issues.apache.org/jira/browse/CASSANDRA-14629

Chris's proposal here is to enable query results from virtual tables to be streamed to the client rather than being fully materialized. There are some neat possibilities suggested in this ticket, such as debug functionality to dump the contents of a raw SSTable via the CQL interface, or the contents of the database's internal caches. One could also imagine a feature like this providing functionality similar to a foreign data wrapper in other databases.

I don't think the assumption that "virtual tables will always be small and always fit in memory" is a safe one.

I don't think we should implicitly add "ALLOW FILTERING" to all queries against virtual tables because of this, in addition to concern with departing from standard CQL semantics for a type of tables deemed special.

– Scott

On Feb 3, 2023, at 6:52 AM, Maxim Muzafarov <mm...@apache.org>> wrote:


Hello Stefan,

Regarding the decision to implicitly enable ALLOW FILTERING for
virtual tables, which also makes sense to me, it may be necessary to
consider changing the clustering columns in the virtual table metadata
to regular columns as well. The reasons are the same as mentioned
earlier: the virtual tables hold their data in memory, thus we do not
benefit from the advantages of ordered data (e.g. the ClientsTable and
its ClusteringColumn(PORT)).

Changing the clustering column to a regular column may simplify the
virtual table data model, but I'm afraid it may affect users who rely
on the table metadata.



On Fri, 3 Feb 2023 at 12:32, Andrés de la Peña <ad...@apache.org>> wrote:

I think removing the need for ALLOW FILTERING on virtual tables makes sense and would be quite useful for operators.

That guard exists for performance issues that shouldn't occur on virtual tables. We also have a flag in case some future virtual table implementation has limitations regarding filtering, although it seems it's not the case with any of the existing virtual tables.

It is not like we would promote bad habits because virtual tables are meant to be queried by operators / administrators only.


It might even be quite the opposite, since in the current situation users might get used to routinely use ALLOW FILTERING for querying their virtual tables.

It has been mentioned on the #cassandra-dev Slack thread where this started (1) that it's kind of an API inconsistency to allow querying by non-primary keys on virtual tables without ALLOW FILTERING, whereas it's required for regular tables. I think that a simply doc update saying that virtual tables, which are not regular tables, support filtering would be enough. Virtual tables are well identified by both the keyspace they belong to and doc, so users shouldn't have trouble knowing whether a table is virtual. It would be similar to the current exception for ALLOW FILTERING, where one needs to use it unless the table has an index for the queried column.

(1) https://the-asf.slack.com/archives/CK23JSY2K/p1675352759267329

On Fri, 3 Feb 2023 at 09:09, Miklosovic, Stefan <St...@netapp.com>> wrote:

Hi list,

the content of virtual tables is held in memory (and / or is fetched every time upon request). While doing queries against such table for a column outside of primary key, normally, users are required to specify ALLOW FILTERING. This makes total sense for "ordinary tables" for applications to have performant and effective queries but it kinds of loses the applicability for virtual tables when it literally holds just handful of entries in memory and it just does not matter, does it?

What do you think about implicitly allowing filtering for virtual tables so we save ourselves from these pesky errors when we want to query arbitrary column and we need to satisfy CQL spec just to do that?

It is not like we would promote bad habits because virtual tables are meant to be queried by operators / administrators only.

We can also explicitly document this behavior.

Among other options, we may try to implement secondary indices on virtual tables but I am not completely sure this is what we want because its complexity etc. Is it even necessary to put such complex logic in place just to be able to select any column on few entries in memory?

I put together a draft here (1). It would be ever possible to implicitly allow filtering on virtual tables only and it would be implementator's responsibility to decide that, per table.

For all virtual tables we currently have, I would enable this everywhere. I do not think there is any virtual table where we would not want to enable it or where people HAVE TO specify that.

(1) https://github.com/apache/cassandra/pull/2131






Re: Implicitly enabling ALLOW FILTERING on virtual tables

Posted by Josh McKenzie <jm...@apache.org>.
> they would start to set ALLOW FILTERING here and there in order to not think twice about their data model so they can just call it a day.
Setting this on a per-table basis or having users set this on specific queries that hit tables and forgetting they set it are 6 of one and half-a-dozen of another.

I like the table property idea personally. That communicates an intent about the data model and expectation of the size and usage of data in the modeling of the schema that embeds some context and intent there's currently no mechanism to communicate.

On Fri, Feb 3, 2023, at 5:00 PM, Miklosovic, Stefan wrote:
> Yes, there would be discrepancy. I do not like that either. If it was only about "normal tables vs virtual tables", I could live with that. But the fact that there are going to be differences among vtables themselves, that starts to be a little bit messy. Then we would need to let operators know what tables are always allowed to be filtered on and which do not and that just complicates it. Putting that information to comment so it is visible in DECSCRIBE is nice idea.
> 
> That flag we talk about ... that flag would be used purely internally, it would not be in schema to be gossiped.
> 
> Also, I am starting to like the suggestion to have something like ALLOW FILTERING ON in CQLSH so it would be turned on whole CQL session. That leaves tables as they are and it should not be a big deal for operators to set. We would have to make sure to add "ALLOW FILTERING" clause to every SELECT statement (to virtual tables only?) a user submits. I am not sure if this is doable yet though.
> 
> ________________________________________
> From: David Capwell <dc...@apple.com>
> Sent: Friday, February 3, 2023 22:42
> To: dev
> Cc: Maxim Muzafarov
> Subject: Re: Implicitly enabling ALLOW FILTERING on virtual tables
> 
> NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
> 
> 
> 
> I don't think the assumption that "virtual tables will always be small and always fit in memory" is a safe one.
> 
> Agree, there is a repair ticket to have the coordinating node do network queries to peers to resolve the table (rather than operator querying everything, allow the coordinator node to do it for you)… so this assumption may not be true down the line.
> 
> I could be open to a table property that says ALLOW FILTERING on by default or not… then we can pick and choose vtables (or have vtables opt-out)…. I kinda like like the lack of consistency with this approach though
> 
> On Feb 3, 2023, at 11:24 AM, C. Scott Andreas <sc...@paradoxica.net> wrote:
> 
> There are some ideas that development community members have kicked around that may falsify the assumption that "virtual tables are tiny and will fit in memory."
> 
> One example is CASSANDRA-14629: Abstract Virtual Table for very large result sets
> https://issues.apache.org/jira/browse/CASSANDRA-14629
> 
> Chris's proposal here is to enable query results from virtual tables to be streamed to the client rather than being fully materialized. There are some neat possibilities suggested in this ticket, such as debug functionality to dump the contents of a raw SSTable via the CQL interface, or the contents of the database's internal caches. One could also imagine a feature like this providing functionality similar to a foreign data wrapper in other databases.
> 
> I don't think the assumption that "virtual tables will always be small and always fit in memory" is a safe one.
> 
> I don't think we should implicitly add "ALLOW FILTERING" to all queries against virtual tables because of this, in addition to concern with departing from standard CQL semantics for a type of tables deemed special.
> 
> – Scott
> 
> On Feb 3, 2023, at 6:52 AM, Maxim Muzafarov <mm...@apache.org> wrote:
> 
> 
> Hello Stefan,
> 
> Regarding the decision to implicitly enable ALLOW FILTERING for
> virtual tables, which also makes sense to me, it may be necessary to
> consider changing the clustering columns in the virtual table metadata
> to regular columns as well. The reasons are the same as mentioned
> earlier: the virtual tables hold their data in memory, thus we do not
> benefit from the advantages of ordered data (e.g. the ClientsTable and
> its ClusteringColumn(PORT)).
> 
> Changing the clustering column to a regular column may simplify the
> virtual table data model, but I'm afraid it may affect users who rely
> on the table metadata.
> 
> 
> 
> On Fri, 3 Feb 2023 at 12:32, Andrés de la Peña <ad...@apache.org> wrote:
> 
> I think removing the need for ALLOW FILTERING on virtual tables makes sense and would be quite useful for operators.
> 
> That guard exists for performance issues that shouldn't occur on virtual tables. We also have a flag in case some future virtual table implementation has limitations regarding filtering, although it seems it's not the case with any of the existing virtual tables.
> 
> It is not like we would promote bad habits because virtual tables are meant to be queried by operators / administrators only.
> 
> 
> It might even be quite the opposite, since in the current situation users might get used to routinely use ALLOW FILTERING for querying their virtual tables.
> 
> It has been mentioned on the #cassandra-dev Slack thread where this started (1) that it's kind of an API inconsistency to allow querying by non-primary keys on virtual tables without ALLOW FILTERING, whereas it's required for regular tables. I think that a simply doc update saying that virtual tables, which are not regular tables, support filtering would be enough. Virtual tables are well identified by both the keyspace they belong to and doc, so users shouldn't have trouble knowing whether a table is virtual. It would be similar to the current exception for ALLOW FILTERING, where one needs to use it unless the table has an index for the queried column.
> 
> (1) https://the-asf.slack.com/archives/CK23JSY2K/p1675352759267329
> 
> On Fri, 3 Feb 2023 at 09:09, Miklosovic, Stefan <St...@netapp.com> wrote:
> 
> Hi list,
> 
> the content of virtual tables is held in memory (and / or is fetched every time upon request). While doing queries against such table for a column outside of primary key, normally, users are required to specify ALLOW FILTERING. This makes total sense for "ordinary tables" for applications to have performant and effective queries but it kinds of loses the applicability for virtual tables when it literally holds just handful of entries in memory and it just does not matter, does it?
> 
> What do you think about implicitly allowing filtering for virtual tables so we save ourselves from these pesky errors when we want to query arbitrary column and we need to satisfy CQL spec just to do that?
> 
> It is not like we would promote bad habits because virtual tables are meant to be queried by operators / administrators only.
> 
> We can also explicitly document this behavior.
> 
> Among other options, we may try to implement secondary indices on virtual tables but I am not completely sure this is what we want because its complexity etc. Is it even necessary to put such complex logic in place just to be able to select any column on few entries in memory?
> 
> I put together a draft here (1). It would be ever possible to implicitly allow filtering on virtual tables only and it would be implementator's responsibility to decide that, per table.
> 
> For all virtual tables we currently have, I would enable this everywhere. I do not think there is any virtual table where we would not want to enable it or where people HAVE TO specify that.
> 
> (1) https://github.com/apache/cassandra/pull/2131
> 
> 
> 
> 

Re: Implicitly enabling ALLOW FILTERING on virtual tables

Posted by "Miklosovic, Stefan" <St...@netapp.com>.
Yes, there would be discrepancy. I do not like that either. If it was only about "normal tables vs virtual tables", I could live with that. But the fact that there are going to be differences among vtables themselves, that starts to be a little bit messy. Then we would need to let operators know what tables are always allowed to be filtered on and which do not and that just complicates it. Putting that information to comment so it is visible in DECSCRIBE is nice idea.

That flag we talk about ... that flag would be used purely internally, it would not be in schema to be gossiped.

Also, I am starting to like the suggestion to have something like ALLOW FILTERING ON in CQLSH so it would be turned on whole CQL session. That leaves tables as they are and it should not be a big deal for operators to set. We would have to make sure to add "ALLOW FILTERING" clause to every SELECT statement (to virtual tables only?) a user submits. I am not sure if this is doable yet though.

________________________________________
From: David Capwell <dc...@apple.com>
Sent: Friday, February 3, 2023 22:42
To: dev
Cc: Maxim Muzafarov
Subject: Re: Implicitly enabling ALLOW FILTERING on virtual tables

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.



I don't think the assumption that "virtual tables will always be small and always fit in memory" is a safe one.

Agree, there is a repair ticket to have the coordinating node do network queries to peers to resolve the table (rather than operator querying everything, allow the coordinator node to do it for you)… so this assumption may not be true down the line.

I could be open to a table property that says ALLOW FILTERING on by default or not… then we can pick and choose vtables (or have vtables opt-out)…. I kinda like like the lack of consistency with this approach though

On Feb 3, 2023, at 11:24 AM, C. Scott Andreas <sc...@paradoxica.net> wrote:

There are some ideas that development community members have kicked around that may falsify the assumption that "virtual tables are tiny and will fit in memory."

One example is CASSANDRA-14629: Abstract Virtual Table for very large result sets
https://issues.apache.org/jira/browse/CASSANDRA-14629

Chris's proposal here is to enable query results from virtual tables to be streamed to the client rather than being fully materialized. There are some neat possibilities suggested in this ticket, such as debug functionality to dump the contents of a raw SSTable via the CQL interface, or the contents of the database's internal caches. One could also imagine a feature like this providing functionality similar to a foreign data wrapper in other databases.

I don't think the assumption that "virtual tables will always be small and always fit in memory" is a safe one.

I don't think we should implicitly add "ALLOW FILTERING" to all queries against virtual tables because of this, in addition to concern with departing from standard CQL semantics for a type of tables deemed special.

– Scott

On Feb 3, 2023, at 6:52 AM, Maxim Muzafarov <mm...@apache.org> wrote:


Hello Stefan,

Regarding the decision to implicitly enable ALLOW FILTERING for
virtual tables, which also makes sense to me, it may be necessary to
consider changing the clustering columns in the virtual table metadata
to regular columns as well. The reasons are the same as mentioned
earlier: the virtual tables hold their data in memory, thus we do not
benefit from the advantages of ordered data (e.g. the ClientsTable and
its ClusteringColumn(PORT)).

Changing the clustering column to a regular column may simplify the
virtual table data model, but I'm afraid it may affect users who rely
on the table metadata.



On Fri, 3 Feb 2023 at 12:32, Andrés de la Peña <ad...@apache.org> wrote:

I think removing the need for ALLOW FILTERING on virtual tables makes sense and would be quite useful for operators.

That guard exists for performance issues that shouldn't occur on virtual tables. We also have a flag in case some future virtual table implementation has limitations regarding filtering, although it seems it's not the case with any of the existing virtual tables.

It is not like we would promote bad habits because virtual tables are meant to be queried by operators / administrators only.


It might even be quite the opposite, since in the current situation users might get used to routinely use ALLOW FILTERING for querying their virtual tables.

It has been mentioned on the #cassandra-dev Slack thread where this started (1) that it's kind of an API inconsistency to allow querying by non-primary keys on virtual tables without ALLOW FILTERING, whereas it's required for regular tables. I think that a simply doc update saying that virtual tables, which are not regular tables, support filtering would be enough. Virtual tables are well identified by both the keyspace they belong to and doc, so users shouldn't have trouble knowing whether a table is virtual. It would be similar to the current exception for ALLOW FILTERING, where one needs to use it unless the table has an index for the queried column.

(1) https://the-asf.slack.com/archives/CK23JSY2K/p1675352759267329

On Fri, 3 Feb 2023 at 09:09, Miklosovic, Stefan <St...@netapp.com> wrote:

Hi list,

the content of virtual tables is held in memory (and / or is fetched every time upon request). While doing queries against such table for a column outside of primary key, normally, users are required to specify ALLOW FILTERING. This makes total sense for "ordinary tables" for applications to have performant and effective queries but it kinds of loses the applicability for virtual tables when it literally holds just handful of entries in memory and it just does not matter, does it?

What do you think about implicitly allowing filtering for virtual tables so we save ourselves from these pesky errors when we want to query arbitrary column and we need to satisfy CQL spec just to do that?

It is not like we would promote bad habits because virtual tables are meant to be queried by operators / administrators only.

We can also explicitly document this behavior.

Among other options, we may try to implement secondary indices on virtual tables but I am not completely sure this is what we want because its complexity etc. Is it even necessary to put such complex logic in place just to be able to select any column on few entries in memory?

I put together a draft here (1). It would be ever possible to implicitly allow filtering on virtual tables only and it would be implementator's responsibility to decide that, per table.

For all virtual tables we currently have, I would enable this everywhere. I do not think there is any virtual table where we would not want to enable it or where people HAVE TO specify that.

(1) https://github.com/apache/cassandra/pull/2131




Re: Implicitly enabling ALLOW FILTERING on virtual tables

Posted by David Capwell <dc...@apple.com>.
> I don't think the assumption that "virtual tables will always be small and always fit in memory" is a safe one.

Agree, there is a repair ticket to have the coordinating node do network queries to peers to resolve the table (rather than operator querying everything, allow the coordinator node to do it for you)… so this assumption may not be true down the line.

I could be open to a table property that says ALLOW FILTERING on by default or not… then we can pick and choose vtables (or have vtables opt-out)…. I kinda like like the lack of consistency with this approach though

> On Feb 3, 2023, at 11:24 AM, C. Scott Andreas <sc...@paradoxica.net> wrote:
> 
> There are some ideas that development community members have kicked around that may falsify the assumption that "virtual tables are tiny and will fit in memory."
> 
> One example is CASSANDRA-14629: Abstract Virtual Table for very large result sets
> https://issues.apache.org/jira/browse/CASSANDRA-14629
> 
> Chris's proposal here is to enable query results from virtual tables to be streamed to the client rather than being fully materialized. There are some neat possibilities suggested in this ticket, such as debug functionality to dump the contents of a raw SSTable via the CQL interface, or the contents of the database's internal caches. One could also imagine a feature like this providing functionality similar to a foreign data wrapper in other databases.
> 
> I don't think the assumption that "virtual tables will always be small and always fit in memory" is a safe one.
> 
> I don't think we should implicitly add "ALLOW FILTERING" to all queries against virtual tables because of this, in addition to concern with departing from standard CQL semantics for a type of tables deemed special.
> 
> – Scott
> 
>> On Feb 3, 2023, at 6:52 AM, Maxim Muzafarov <mm...@apache.org> wrote:
>> 
>> 
>> Hello Stefan,
>> 
>> Regarding the decision to implicitly enable ALLOW FILTERING for
>> virtual tables, which also makes sense to me, it may be necessary to
>> consider changing the clustering columns in the virtual table metadata
>> to regular columns as well. The reasons are the same as mentioned
>> earlier: the virtual tables hold their data in memory, thus we do not
>> benefit from the advantages of ordered data (e.g. the ClientsTable and
>> its ClusteringColumn(PORT)).
>> 
>> Changing the clustering column to a regular column may simplify the
>> virtual table data model, but I'm afraid it may affect users who rely
>> on the table metadata.
>> 
>> 
>> 
>> On Fri, 3 Feb 2023 at 12:32, Andrés de la Peña <ad...@apache.org> wrote:
>>> 
>>> I think removing the need for ALLOW FILTERING on virtual tables makes sense and would be quite useful for operators.
>>> 
>>> That guard exists for performance issues that shouldn't occur on virtual tables. We also have a flag in case some future virtual table implementation has limitations regarding filtering, although it seems it's not the case with any of the existing virtual tables.
>>> 
>>> It is not like we would promote bad habits because virtual tables are meant to be queried by operators / administrators only.
>>> 
>>> 
>>> It might even be quite the opposite, since in the current situation users might get used to routinely use ALLOW FILTERING for querying their virtual tables.
>>> 
>>> It has been mentioned on the #cassandra-dev Slack thread where this started (1) that it's kind of an API inconsistency to allow querying by non-primary keys on virtual tables without ALLOW FILTERING, whereas it's required for regular tables. I think that a simply doc update saying that virtual tables, which are not regular tables, support filtering would be enough. Virtual tables are well identified by both the keyspace they belong to and doc, so users shouldn't have trouble knowing whether a table is virtual. It would be similar to the current exception for ALLOW FILTERING, where one needs to use it unless the table has an index for the queried column.
>>> 
>>> (1) https://the-asf.slack.com/archives/CK23JSY2K/p1675352759267329
>>> 
>>> On Fri, 3 Feb 2023 at 09:09, Miklosovic, Stefan <St...@netapp.com> wrote:
>>>> 
>>>> Hi list,
>>>> 
>>>> the content of virtual tables is held in memory (and / or is fetched every time upon request). While doing queries against such table for a column outside of primary key, normally, users are required to specify ALLOW FILTERING. This makes total sense for "ordinary tables" for applications to have performant and effective queries but it kinds of loses the applicability for virtual tables when it literally holds just handful of entries in memory and it just does not matter, does it?
>>>> 
>>>> What do you think about implicitly allowing filtering for virtual tables so we save ourselves from these pesky errors when we want to query arbitrary column and we need to satisfy CQL spec just to do that?
>>>> 
>>>> It is not like we would promote bad habits because virtual tables are meant to be queried by operators / administrators only.
>>>> 
>>>> We can also explicitly document this behavior.
>>>> 
>>>> Among other options, we may try to implement secondary indices on virtual tables but I am not completely sure this is what we want because its complexity etc. Is it even necessary to put such complex logic in place just to be able to select any column on few entries in memory?
>>>> 
>>>> I put together a draft here (1). It would be ever possible to implicitly allow filtering on virtual tables only and it would be implementator's responsibility to decide that, per table.
>>>> 
>>>> For all virtual tables we currently have, I would enable this everywhere. I do not think there is any virtual table where we would not want to enable it or where people HAVE TO specify that.
>>>> 
>>>> (1) https://github.com/apache/cassandra/pull/2131
> 
> 


Re: Implicitly enabling ALLOW FILTERING on virtual tables

Posted by "C. Scott Andreas" <sc...@paradoxica.net>.
There are some ideas that development community members have kicked around that may falsify the assumption that "virtual tables are tiny and will fit in memory."One example is CASSANDRA-14629: Abstract Virtual Table for very large result setshttps://issues.apache.org/jira/browse/CASSANDRA-14629Chris's proposal here is to enable query results from virtual tables to be streamed to the client rather than being fully materialized. There are some neat possibilities suggested in this ticket, such as debug functionality to dump the contents of a raw SSTable via the CQL interface, or the contents of the database's internal caches. One could also imagine a feature like this providing functionality similar to a foreign data wrapper in other databases.I don't think the assumption that "virtual tables will always be small and always fit in memory" is a safe one.I don't think we should implicitly add "ALLOW FILTERING" to all queries against virtual tables because of this, in addition to concern with departing from standard CQL semantics for a type of tables deemed special.– ScottOn Feb 3, 2023, at 6:52 AM, Maxim Muzafarov <mm...@apache.org> wrote:Hello Stefan,Regarding the decision to implicitly enable ALLOW FILTERING forvirtual tables, which also makes sense to me, it may be necessary toconsider changing the clustering columns in the virtual table metadatato regular columns as well. The reasons are the same as mentionedearlier: the virtual tables hold their data in memory, thus we do notbenefit from the advantages of ordered data (e.g. the ClientsTable andits ClusteringColumn(PORT)).Changing the clustering column to a regular column may simplify thevirtual table data model, but I'm afraid it may affect users who relyon the table metadata.On Fri, 3 Feb 2023 at 12:32, Andrés de la Peña <ad...@apache.org> wrote:I think removing the need for ALLOW FILTERING on virtual tables makes sense and would be quite useful for operators.That guard exists for performance issues that shouldn't occur on virtual tables. We also have a flag in case some future virtual table implementation has limitations regarding filtering, although it seems it's not the case with any of the existing virtual tables.It is not like we would promote bad habits because virtual tables are meant to be queried by operators / administrators only.It might even be quite the opposite, since in the current situation users might get used to routinely use ALLOW FILTERING for querying their virtual tables.It has been mentioned on the #cassandra-dev Slack thread where this started (1) that it's kind of an API inconsistency to allow querying by non-primary keys on virtual tables without ALLOW FILTERING, whereas it's required for regular tables. I think that a simply doc update saying that virtual tables, which are not regular tables, support filtering would be enough. Virtual tables are well identified by both the keyspace they belong to and doc, so users shouldn't have trouble knowing whether a table is virtual. It would be similar to the current exception for ALLOW FILTERING, where one needs to use it unless the table has an index for the queried column.(1) https://the-asf.slack.com/archives/CK23JSY2K/p1675352759267329On Fri, 3 Feb 2023 at 09:09, Miklosovic, Stefan <St...@netapp.com> wrote:Hi list,the content of virtual tables is held in memory (and / or is fetched every time upon request). While doing queries against such table for a column outside of primary key, normally, users are required to specify ALLOW FILTERING. This makes total sense for "ordinary tables" for applications to have performant and effective queries but it kinds of loses the applicability for virtual tables when it literally holds just handful of entries in memory and it just does not matter, does it?What do you think about implicitly allowing filtering for virtual tables so we save ourselves from these pesky errors when we want to query arbitrary column and we need to satisfy CQL spec just to do that?It is not like we would promote bad habits because virtual tables are meant to be queried by operators / administrators only.We can also explicitly document this behavior.Among other options, we may try to implement secondary indices on virtual tables but I am not completely sure this is what we want because its complexity etc. Is it even necessary to put such complex logic in place just to be able to select any column on few entries in memory?I put together a draft here (1). It would be ever possible to implicitly allow filtering on virtual tables only and it would be implementator's responsibility to decide that, per table.For all virtual tables we currently have, I would enable this everywhere. I do not think there is any virtual table where we would not want to enable it or where people HAVE TO specify that.(1) https://github.com/apache/cassandra/pull/2131

Re: Implicitly enabling ALLOW FILTERING on virtual tables

Posted by Maxim Muzafarov <mm...@apache.org>.
Hello Stefan,

Regarding the decision to implicitly enable ALLOW FILTERING for
virtual tables, which also makes sense to me, it may be necessary to
consider changing the clustering columns in the virtual table metadata
to regular columns as well. The reasons are the same as mentioned
earlier: the virtual tables hold their data in memory, thus we do not
benefit from the advantages of ordered data (e.g. the ClientsTable and
its ClusteringColumn(PORT)).

Changing the clustering column to a regular column may simplify the
virtual table data model, but I'm afraid it may affect users who rely
on the table metadata.



On Fri, 3 Feb 2023 at 12:32, Andrés de la Peña <ad...@apache.org> wrote:
>
> I think removing the need for ALLOW FILTERING on virtual tables makes sense and would be quite useful for operators.
>
> That guard exists for performance issues that shouldn't occur on virtual tables. We also have a flag in case some future virtual table implementation has limitations regarding filtering, although it seems it's not the case with any of the existing virtual tables.
>
> It is not like we would promote bad habits because virtual tables are meant to be queried by operators / administrators only.
>
>
> It might even be quite the opposite, since in the current situation users might get used to routinely use ALLOW FILTERING for querying their virtual tables.
>
> It has been mentioned on the #cassandra-dev Slack thread where this started (1) that it's kind of an API inconsistency to allow querying by non-primary keys on virtual tables without ALLOW FILTERING, whereas it's required for regular tables. I think that a simply doc update saying that virtual tables, which are not regular tables, support filtering would be enough. Virtual tables are well identified by both the keyspace they belong to and doc, so users shouldn't have trouble knowing whether a table is virtual. It would be similar to the current exception for ALLOW FILTERING, where one needs to use it unless the table has an index for the queried column.
>
> (1) https://the-asf.slack.com/archives/CK23JSY2K/p1675352759267329
>
> On Fri, 3 Feb 2023 at 09:09, Miklosovic, Stefan <St...@netapp.com> wrote:
>>
>> Hi list,
>>
>> the content of virtual tables is held in memory (and / or is fetched every time upon request). While doing queries against such table for a column outside of primary key, normally, users are required to specify ALLOW FILTERING. This makes total sense for "ordinary tables" for applications to have performant and effective queries but it kinds of loses the applicability for virtual tables when it literally holds just handful of entries in memory and it just does not matter, does it?
>>
>> What do you think about implicitly allowing filtering for virtual tables so we save ourselves from these pesky errors when we want to query arbitrary column and we need to satisfy CQL spec just to do that?
>>
>> It is not like we would promote bad habits because virtual tables are meant to be queried by operators / administrators only.
>>
>> We can also explicitly document this behavior.
>>
>> Among other options, we may try to implement secondary indices on virtual tables but I am not completely sure this is what we want because its complexity etc. Is it even necessary to put such complex logic in place just to be able to select any column on few entries in memory?
>>
>> I put together a draft here (1). It would be ever possible to implicitly allow filtering on virtual tables only and it would be implementator's responsibility to decide that, per table.
>>
>> For all virtual tables we currently have, I would enable this everywhere. I do not think there is any virtual table where we would not want to enable it or where people HAVE TO specify that.
>>
>> (1) https://github.com/apache/cassandra/pull/2131

Re: Implicitly enabling ALLOW FILTERING on virtual tables

Posted by Andrés de la Peña <ad...@apache.org>.
I think removing the need for ALLOW FILTERING on virtual tables makes sense
and would be quite useful for operators.

That guard exists for performance issues that shouldn't occur on virtual
tables. We also have a flag in case some future virtual table
implementation has limitations regarding filtering, although it seems it's
not the case with any of the existing virtual tables.

It is not like we would promote bad habits because virtual tables are meant
to be queried by operators / administrators only.


It might even be quite the opposite, since in the current situation users
might get used to routinely use ALLOW FILTERING for querying their virtual
tables.

It has been mentioned on the #cassandra-dev Slack thread where this started
(1) that it's kind of an API inconsistency to allow querying by non-primary
keys on virtual tables without ALLOW FILTERING, whereas it's required for
regular tables. I think that a simply doc update saying that virtual
tables, which are not regular tables, support filtering would be enough.
Virtual tables are well identified by both the keyspace they belong to and
doc, so users shouldn't have trouble knowing whether a table is virtual. It
would be similar to the current exception for ALLOW FILTERING, where one
needs to use it unless the table has an index for the queried column.

(1) https://the-asf.slack.com/archives/CK23JSY2K/p1675352759267329

On Fri, 3 Feb 2023 at 09:09, Miklosovic, Stefan <
Stefan.Miklosovic@netapp.com> wrote:

> Hi list,
>
> the content of virtual tables is held in memory (and / or is fetched every
> time upon request). While doing queries against such table for a column
> outside of primary key, normally, users are required to specify ALLOW
> FILTERING. This makes total sense for "ordinary tables" for applications to
> have performant and effective queries but it kinds of loses the
> applicability for virtual tables when it literally holds just handful of
> entries in memory and it just does not matter, does it?
>
> What do you think about implicitly allowing filtering for virtual tables
> so we save ourselves from these pesky errors when we want to query
> arbitrary column and we need to satisfy CQL spec just to do that?
>
> It is not like we would promote bad habits because virtual tables are
> meant to be queried by operators / administrators only.
>
> We can also explicitly document this behavior.
>
> Among other options, we may try to implement secondary indices on virtual
> tables but I am not completely sure this is what we want because its
> complexity etc. Is it even necessary to put such complex logic in place
> just to be able to select any column on few entries in memory?
>
> I put together a draft here (1). It would be ever possible to implicitly
> allow filtering on virtual tables only and it would be implementator's
> responsibility to decide that, per table.
>
> For all virtual tables we currently have, I would enable this everywhere.
> I do not think there is any virtual table where we would not want to enable
> it or where people HAVE TO specify that.
>
> (1) https://github.com/apache/cassandra/pull/2131