You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Avi Levi <av...@indeni.com> on 2017/10/09 12:13:38 UTC

Using materialized view or AllowFiltering which one is better ?

Hi

I have the following table:

CREATE TABLE users (
    username text,
    last_seen bigint,
    PRIMARY KEY (username)
);

where* last_seen* is basically the writetime . Number of records in the
table is aprox 10 million. Insert is pretty much straightforward insert
into users (username, last_seen) VALUES ([username], now)

I want to make some processing on users that were not seen for the past XXX
(where xxx can be hours/days ... ) by query the last_seen column (this
query runs every minute) e.g :

select username from users where last_seen < (now - 1 day).

I have two options as I see it:

   1. use materialized view :

CREATE MATERIALIZED VIEW users_last_seen AS
SELECT last_seen, username
FROM users
WHERE last_seen IS NOT NULL
PRIMARY KEY (last_seen, username);


and simply query:

select username from users_last_seen where last_seen < (now - 1 day)

   1.

   query the users table

   select username from users where last_seen < (now - 1 day) ALLOW
   FILTERING

which one is more efficient? any other options ?

Any help will be greatly appreciated

Best

Avi

Re: Using materialized view or AllowFiltering which one is better ?

Posted by Avi Levi <av...@indeni.com>.
Thank you Crisan.
Using SASI does seems better solution. Although it is officially
<https://docs.datastax.com/en/dse/5.1/cql/cql/cql_using/useSASIIndex.html>
not supported in production, I think that this is the optimal solution in
this case

On Mon, Oct 9, 2017 at 11:01 PM, Valentina Crisan <
valentina.crisan@gmail.com> wrote:

> Not really, my suggested primary key is similar to the one you have in
> your proposed MV. The only difference is that in MV it is Cassandra that
> takes care of data synchronization,  with manual denormalization you would
> need to do it yourself. Example with MV: If you had username 'andreas1988'
> and last_seen "2017-09-11 23:58:23' in your base table and then this user
> access the service and last_seen is updated to "2017-10-09 23:58:23" in
> your base table - what will happen in the background is that MV Cassandra
> will delete in a batch operation from the partition "2017-09-11 23:58:23"
> the username "andreas1988" and add it to partition "2017-10-09 23:58:23".
> Only when this batch will finish Cassandra will update the base table.
> If you denormalize manually it will be you that will need to create
> batches operations and do this changes manually, making sure that you only
> save last value for last_seen in your table. You will obtain the same in
> the end only the operation effort will be bigger.
> I understand why MV would be good for your requirements, but I have seen
> from the discussions that MV is not recommended for production mainly due
> to the fact that is not possible to check if a view is out of sync with the
> base table. Check older discussions (one or two weeks ago) and see details
> there re MV usage in production.
>
> One other solution could be to work on your users table with a secondary
> index on last_seen field ( Cassandra 3.4 onwards,  SASI would allow
> operators like <, > and multiple columns indexing) - clearly better than
> allow filtering - but still the whole cluster would be contacted most of
> the times for your queries. Maybe combining Cassandra SASI with Spark data
> locality could solve this better. But first you could try with SASI and see
> the query performance.
>
> Valentina
>
>
> On Mon, Oct 9, 2017 at 7:56 PM, Avi Levi <av...@indeni.com> wrote:
>
>> Thanks Crisan .
>> I understand what you're saying. But according to your suggestion I will
>> have a record for every entry while I am interested only on the last entry
>> . So the proposed solution is actually keeping much more data then needed .
>>
>> On Oct 9, 2017 8:40 PM, "Valentina Crisan" <va...@gmail.com>
>> wrote:
>>
>> Allow filtering is almost never the answer, especially when you want to
>> do a full table scan ( there might be some cases where the query is limited
>> to a partition and allow filtering could be used). And you would like to
>> run this query every minute - thus extremely good performance is required.
>> Allow filtering basically brings locally in your coordinator the whole
>> table content and performs local filtering of the data before answering
>> your query. Performance wise is not recommended to use such an
>> implementation.
>>
>> For a query running every minute you need to address it in one partition
>> read (according to Cassandra data modeling rules) and that can be done with
>> denormalization ( manually or materialized views). As far as I know and
>> also from the discussions in this list MV should be used still with caution
>> in production environments. Thus, the best option in my opinion is manual
>> denormalization of data, building a table with partition key last_seen and
>> clustering key username and adding/updating data accordingly. Furthermore
>> last_seen I understand it's a value of any time/hour of day - you could
>> consider building partitions per day: partition key  = (last_seen, day),
>> primary key = ((last_seen,day),username)).
>>
>> Valentina
>>
>> On Mon, Oct 9, 2017 at 1:13 PM, Avi Levi <av...@indeni.com> wrote:
>>
>>> Hi
>>>
>>> I have the following table:
>>>
>>> CREATE TABLE users (
>>>     username text,
>>>     last_seen bigint,
>>>     PRIMARY KEY (username)
>>> );
>>>
>>> where* last_seen* is basically the writetime . Number of records in the
>>> table is aprox 10 million. Insert is pretty much straightforward insert
>>> into users (username, last_seen) VALUES ([username], now)
>>>
>>> I want to make some processing on users that were not seen for the past
>>> XXX (where xxx can be hours/days ... ) by query the last_seen column
>>> (this query runs every minute) e.g :
>>>
>>> select username from users where last_seen < (now - 1 day).
>>>
>>> I have two options as I see it:
>>>
>>>    1. use materialized view :
>>>
>>> CREATE MATERIALIZED VIEW users_last_seen AS
>>> SELECT last_seen, username
>>> FROM users
>>> WHERE last_seen IS NOT NULL
>>> PRIMARY KEY (last_seen, username);
>>>
>>>
>>> and simply query:
>>>
>>> select username from users_last_seen where last_seen < (now - 1 day)
>>>
>>>    1.
>>>
>>>    query the users table
>>>
>>>    select username from users where last_seen < (now - 1 day) ALLOW
>>>    FILTERING
>>>
>>> which one is more efficient? any other options ?
>>>
>>> Any help will be greatly appreciated
>>>
>>> Best
>>>
>>> Avi
>>>
>>
>>
>>
>

Re: Using materialized view or AllowFiltering which one is better ?

Posted by Valentina Crisan <va...@gmail.com>.
Not really, my suggested primary key is similar to the one you have in your
proposed MV. The only difference is that in MV it is Cassandra that takes
care of data synchronization,  with manual denormalization you would need
to do it yourself. Example with MV: If you had username 'andreas1988' and
last_seen "2017-09-11 23:58:23' in your base table and then this user
access the service and last_seen is updated to "2017-10-09 23:58:23" in
your base table - what will happen in the background is that MV Cassandra
will delete in a batch operation from the partition "2017-09-11 23:58:23"
the username "andreas1988" and add it to partition "2017-10-09 23:58:23".
Only when this batch will finish Cassandra will update the base table.
If you denormalize manually it will be you that will need to create batches
operations and do this changes manually, making sure that you only save
last value for last_seen in your table. You will obtain the same in the end
only the operation effort will be bigger.
I understand why MV would be good for your requirements, but I have seen
from the discussions that MV is not recommended for production mainly due
to the fact that is not possible to check if a view is out of sync with the
base table. Check older discussions (one or two weeks ago) and see details
there re MV usage in production.

One other solution could be to work on your users table with a secondary
index on last_seen field ( Cassandra 3.4 onwards,  SASI would allow
operators like <, > and multiple columns indexing) - clearly better than
allow filtering - but still the whole cluster would be contacted most of
the times for your queries. Maybe combining Cassandra SASI with Spark data
locality could solve this better. But first you could try with SASI and see
the query performance.

Valentina


On Mon, Oct 9, 2017 at 7:56 PM, Avi Levi <av...@indeni.com> wrote:

> Thanks Crisan .
> I understand what you're saying. But according to your suggestion I will
> have a record for every entry while I am interested only on the last entry
> . So the proposed solution is actually keeping much more data then needed .
>
> On Oct 9, 2017 8:40 PM, "Valentina Crisan" <va...@gmail.com>
> wrote:
>
> Allow filtering is almost never the answer, especially when you want to do
> a full table scan ( there might be some cases where the query is limited to
> a partition and allow filtering could be used). And you would like to run
> this query every minute - thus extremely good performance is required.
> Allow filtering basically brings locally in your coordinator the whole
> table content and performs local filtering of the data before answering
> your query. Performance wise is not recommended to use such an
> implementation.
>
> For a query running every minute you need to address it in one partition
> read (according to Cassandra data modeling rules) and that can be done with
> denormalization ( manually or materialized views). As far as I know and
> also from the discussions in this list MV should be used still with caution
> in production environments. Thus, the best option in my opinion is manual
> denormalization of data, building a table with partition key last_seen and
> clustering key username and adding/updating data accordingly. Furthermore
> last_seen I understand it's a value of any time/hour of day - you could
> consider building partitions per day: partition key  = (last_seen, day),
> primary key = ((last_seen,day),username)).
>
> Valentina
>
> On Mon, Oct 9, 2017 at 1:13 PM, Avi Levi <av...@indeni.com> wrote:
>
>> Hi
>>
>> I have the following table:
>>
>> CREATE TABLE users (
>>     username text,
>>     last_seen bigint,
>>     PRIMARY KEY (username)
>> );
>>
>> where* last_seen* is basically the writetime . Number of records in the
>> table is aprox 10 million. Insert is pretty much straightforward insert
>> into users (username, last_seen) VALUES ([username], now)
>>
>> I want to make some processing on users that were not seen for the past
>> XXX (where xxx can be hours/days ... ) by query the last_seen column
>> (this query runs every minute) e.g :
>>
>> select username from users where last_seen < (now - 1 day).
>>
>> I have two options as I see it:
>>
>>    1. use materialized view :
>>
>> CREATE MATERIALIZED VIEW users_last_seen AS
>> SELECT last_seen, username
>> FROM users
>> WHERE last_seen IS NOT NULL
>> PRIMARY KEY (last_seen, username);
>>
>>
>> and simply query:
>>
>> select username from users_last_seen where last_seen < (now - 1 day)
>>
>>    1.
>>
>>    query the users table
>>
>>    select username from users where last_seen < (now - 1 day) ALLOW
>>    FILTERING
>>
>> which one is more efficient? any other options ?
>>
>> Any help will be greatly appreciated
>>
>> Best
>>
>> Avi
>>
>
>
>

Re: Using materialized view or AllowFiltering which one is better ?

Posted by Avi Levi <av...@indeni.com>.
Thanks Crisan .
I understand what you're saying. But according to your suggestion I will
have a record for every entry while I am interested only on the last entry
. So the proposed solution is actually keeping much more data then needed .

On Oct 9, 2017 8:40 PM, "Valentina Crisan" <va...@gmail.com>
wrote:

Allow filtering is almost never the answer, especially when you want to do
a full table scan ( there might be some cases where the query is limited to
a partition and allow filtering could be used). And you would like to run
this query every minute - thus extremely good performance is required.
Allow filtering basically brings locally in your coordinator the whole
table content and performs local filtering of the data before answering
your query. Performance wise is not recommended to use such an
implementation.

For a query running every minute you need to address it in one partition
read (according to Cassandra data modeling rules) and that can be done with
denormalization ( manually or materialized views). As far as I know and
also from the discussions in this list MV should be used still with caution
in production environments. Thus, the best option in my opinion is manual
denormalization of data, building a table with partition key last_seen and
clustering key username and adding/updating data accordingly. Furthermore
last_seen I understand it's a value of any time/hour of day - you could
consider building partitions per day: partition key  = (last_seen, day),
primary key = ((last_seen,day),username)).

Valentina

On Mon, Oct 9, 2017 at 1:13 PM, Avi Levi <av...@indeni.com> wrote:

> Hi
>
> I have the following table:
>
> CREATE TABLE users (
>     username text,
>     last_seen bigint,
>     PRIMARY KEY (username)
> );
>
> where* last_seen* is basically the writetime . Number of records in the
> table is aprox 10 million. Insert is pretty much straightforward insert
> into users (username, last_seen) VALUES ([username], now)
>
> I want to make some processing on users that were not seen for the past
> XXX (where xxx can be hours/days ... ) by query the last_seen column
> (this query runs every minute) e.g :
>
> select username from users where last_seen < (now - 1 day).
>
> I have two options as I see it:
>
>    1. use materialized view :
>
> CREATE MATERIALIZED VIEW users_last_seen AS
> SELECT last_seen, username
> FROM users
> WHERE last_seen IS NOT NULL
> PRIMARY KEY (last_seen, username);
>
>
> and simply query:
>
> select username from users_last_seen where last_seen < (now - 1 day)
>
>    1.
>
>    query the users table
>
>    select username from users where last_seen < (now - 1 day) ALLOW
>    FILTERING
>
> which one is more efficient? any other options ?
>
> Any help will be greatly appreciated
>
> Best
>
> Avi
>

Re: Using materialized view or AllowFiltering which one is better ?

Posted by Valentina Crisan <va...@gmail.com>.
Allow filtering is almost never the answer, especially when you want to do
a full table scan ( there might be some cases where the query is limited to
a partition and allow filtering could be used). And you would like to run
this query every minute - thus extremely good performance is required.
Allow filtering basically brings locally in your coordinator the whole
table content and performs local filtering of the data before answering
your query. Performance wise is not recommended to use such an
implementation.

For a query running every minute you need to address it in one partition
read (according to Cassandra data modeling rules) and that can be done with
denormalization ( manually or materialized views). As far as I know and
also from the discussions in this list MV should be used still with caution
in production environments. Thus, the best option in my opinion is manual
denormalization of data, building a table with partition key last_seen and
clustering key username and adding/updating data accordingly. Furthermore
last_seen I understand it's a value of any time/hour of day - you could
consider building partitions per day: partition key  = (last_seen, day),
primary key = ((last_seen,day),username)).

Valentina

On Mon, Oct 9, 2017 at 1:13 PM, Avi Levi <av...@indeni.com> wrote:

> Hi
>
> I have the following table:
>
> CREATE TABLE users (
>     username text,
>     last_seen bigint,
>     PRIMARY KEY (username)
> );
>
> where* last_seen* is basically the writetime . Number of records in the
> table is aprox 10 million. Insert is pretty much straightforward insert
> into users (username, last_seen) VALUES ([username], now)
>
> I want to make some processing on users that were not seen for the past
> XXX (where xxx can be hours/days ... ) by query the last_seen column
> (this query runs every minute) e.g :
>
> select username from users where last_seen < (now - 1 day).
>
> I have two options as I see it:
>
>    1. use materialized view :
>
> CREATE MATERIALIZED VIEW users_last_seen AS
> SELECT last_seen, username
> FROM users
> WHERE last_seen IS NOT NULL
> PRIMARY KEY (last_seen, username);
>
>
> and simply query:
>
> select username from users_last_seen where last_seen < (now - 1 day)
>
>    1.
>
>    query the users table
>
>    select username from users where last_seen < (now - 1 day) ALLOW
>    FILTERING
>
> which one is more efficient? any other options ?
>
> Any help will be greatly appreciated
>
> Best
>
> Avi
>