You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Wim Deblauwe <wi...@gmail.com> on 2014/06/30 09:59:13 UTC

Best way to delete by day?

Hi,

I am getting started with Cassandra (coming from MySQL). I have made a
table with timeseries data (inspired on
http://planetcassandra.org/blog/post/getting-started-with-time-series-data-modeling/
).

The table looks like this:

CREATE TABLE event_message (
message_id uuid,
message_source_id uuid,
message_time timestamp,
event_type_id varchar,
event_state varchar,
filter_state varchar,
image_id uuid,
device_specific_id bigint,
device_specific_begin_id bigint,
characteristics varchar,
PRIMARY KEY (message_source_id, message_time, message_id)
);

I have now 2 requirements:
1) I need to remove rows after a certain (user settable) time (between 5
and 60 days). In MySQL, we used partitions by day to quickly delete a whole
day.
2) I need to store a big binary file along with each row and this file
should be removed when the row is removed.

I was looking into the expiring columns (with the TTL), but is this a good
fit for this use case? Is this TTL stored between restarts of Cassandra?

Would there be any advantage to use the system called "Partitioning to
limit row size – Time Series Pattern 2" in the URL and then explicitly
doing a delete of a whole day? With this system, if I query by time, do I
need to calculate what days are in the interval and explicitly add this in
my query to find the good partitions?

How can I get notifications if a row is expired when using TTL so I can
removed the associated file?

regards,

Wim

Re: Best way to delete by day?

Posted by DuyHai Doan <do...@gmail.com>.
It is recommended to split your big binary data into small chunks of
5Mb/10Mb if you want to store them in C*.  Astyanax framework can help with
this.

If you store binary files outside C*, yes, you'll need to manage the
deletion manually, at the same time you delete the partition for a day of
data in C*


On Mon, Jun 30, 2014 at 10:28 AM, Wim Deblauwe <wi...@gmail.com>
wrote:

> Hi,
>
> Thanks for the answers.
>
> Are you saying that I could store big binary files in Cassandra ? I have
> read somewhere that if the file is more than 10 Mb, it is probably not such
> a good idea? The binary files can be up to 50 or 100 Mb, no more in my case.
>
> So the way I understand it, if I store the binary file outside of
> Cassandra, I need to delete manually and go with strategy 2 since there are
> no notifications.
>
> regards,
>
> Wim
>
>
> 2014-06-30 10:23 GMT+02:00 DuyHai Doan <do...@gmail.com>:
>
> Hello Wim
>>
>> TTL is a good fit for your requirement if you want Cassandra to handle
>> the deletion task for you.
>>
>> Now, clearly there are 2 strategies:
>>
>> 1) Store data on the same partition (physical row) and set TTL to expire
>> data automatically
>> 2) Store data on several partitions, one for each day for example, and
>> manage deletion manually or use TTL again
>>
>> If you have few data, strategy 1 is fine. If your data is huge and/or you
>> need to reclaim disk space quickly (especially with the big binary file),
>> you'll probably better off choosing strategy 2. The only drawback with
>> strategy 2 is when you need querying data that span over several days,
>> you'll have to issue many queries (one for each distinct day) or use the
>> "IN" clause of CQL3 but this has a small performance overhead since.
>>
>> Do not forget to set gc_grace_seconds to 0 to have data removed quickly.
>>
>> About notification, it's not possible right now to be notified on the
>> client side when an expiring column (column with TTL) is physically removed
>> by Cassandra
>>
>>
>>
>>
>> On Mon, Jun 30, 2014 at 9:59 AM, Wim Deblauwe <wi...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I am getting started with Cassandra (coming from MySQL). I have made a
>>> table with timeseries data (inspired on
>>> http://planetcassandra.org/blog/post/getting-started-with-time-series-data-modeling/
>>> ).
>>>
>>> The table looks like this:
>>>
>>> CREATE TABLE event_message (
>>> message_id uuid,
>>> message_source_id uuid,
>>> message_time timestamp,
>>> event_type_id varchar,
>>> event_state varchar,
>>> filter_state varchar,
>>> image_id uuid,
>>> device_specific_id bigint,
>>> device_specific_begin_id bigint,
>>> characteristics varchar,
>>> PRIMARY KEY (message_source_id, message_time, message_id)
>>> );
>>>
>>> I have now 2 requirements:
>>> 1) I need to remove rows after a certain (user settable) time (between 5
>>> and 60 days). In MySQL, we used partitions by day to quickly delete a whole
>>> day.
>>> 2) I need to store a big binary file along with each row and this file
>>> should be removed when the row is removed.
>>>
>>> I was looking into the expiring columns (with the TTL), but is this a
>>> good fit for this use case? Is this TTL stored between restarts of
>>> Cassandra?
>>>
>>> Would there be any advantage to use the system called "Partitioning to
>>> limit row size – Time Series Pattern 2" in the URL and then explicitly
>>> doing a delete of a whole day? With this system, if I query by time, do I
>>> need to calculate what days are in the interval and explicitly add this in
>>> my query to find the good partitions?
>>>
>>> How can I get notifications if a row is expired when using TTL so I can
>>> removed the associated file?
>>>
>>> regards,
>>>
>>> Wim
>>>
>>
>>
>

Re: Best way to delete by day?

Posted by Wim Deblauwe <wi...@gmail.com>.
Hi,

Thanks for the answers.

Are you saying that I could store big binary files in Cassandra ? I have
read somewhere that if the file is more than 10 Mb, it is probably not such
a good idea? The binary files can be up to 50 or 100 Mb, no more in my case.

So the way I understand it, if I store the binary file outside of
Cassandra, I need to delete manually and go with strategy 2 since there are
no notifications.

regards,

Wim


2014-06-30 10:23 GMT+02:00 DuyHai Doan <do...@gmail.com>:

> Hello Wim
>
> TTL is a good fit for your requirement if you want Cassandra to handle the
> deletion task for you.
>
> Now, clearly there are 2 strategies:
>
> 1) Store data on the same partition (physical row) and set TTL to expire
> data automatically
> 2) Store data on several partitions, one for each day for example, and
> manage deletion manually or use TTL again
>
> If you have few data, strategy 1 is fine. If your data is huge and/or you
> need to reclaim disk space quickly (especially with the big binary file),
> you'll probably better off choosing strategy 2. The only drawback with
> strategy 2 is when you need querying data that span over several days,
> you'll have to issue many queries (one for each distinct day) or use the
> "IN" clause of CQL3 but this has a small performance overhead since.
>
> Do not forget to set gc_grace_seconds to 0 to have data removed quickly.
>
> About notification, it's not possible right now to be notified on the
> client side when an expiring column (column with TTL) is physically removed
> by Cassandra
>
>
>
>
> On Mon, Jun 30, 2014 at 9:59 AM, Wim Deblauwe <wi...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am getting started with Cassandra (coming from MySQL). I have made a
>> table with timeseries data (inspired on
>> http://planetcassandra.org/blog/post/getting-started-with-time-series-data-modeling/
>> ).
>>
>> The table looks like this:
>>
>> CREATE TABLE event_message (
>> message_id uuid,
>> message_source_id uuid,
>> message_time timestamp,
>> event_type_id varchar,
>> event_state varchar,
>> filter_state varchar,
>> image_id uuid,
>> device_specific_id bigint,
>> device_specific_begin_id bigint,
>> characteristics varchar,
>> PRIMARY KEY (message_source_id, message_time, message_id)
>> );
>>
>> I have now 2 requirements:
>> 1) I need to remove rows after a certain (user settable) time (between 5
>> and 60 days). In MySQL, we used partitions by day to quickly delete a whole
>> day.
>> 2) I need to store a big binary file along with each row and this file
>> should be removed when the row is removed.
>>
>> I was looking into the expiring columns (with the TTL), but is this a
>> good fit for this use case? Is this TTL stored between restarts of
>> Cassandra?
>>
>> Would there be any advantage to use the system called "Partitioning to
>> limit row size – Time Series Pattern 2" in the URL and then explicitly
>> doing a delete of a whole day? With this system, if I query by time, do I
>> need to calculate what days are in the interval and explicitly add this in
>> my query to find the good partitions?
>>
>> How can I get notifications if a row is expired when using TTL so I can
>> removed the associated file?
>>
>> regards,
>>
>> Wim
>>
>
>

Re: Best way to delete by day?

Posted by DuyHai Doan <do...@gmail.com>.
Hello Wim

TTL is a good fit for your requirement if you want Cassandra to handle the
deletion task for you.

Now, clearly there are 2 strategies:

1) Store data on the same partition (physical row) and set TTL to expire
data automatically
2) Store data on several partitions, one for each day for example, and
manage deletion manually or use TTL again

If you have few data, strategy 1 is fine. If your data is huge and/or you
need to reclaim disk space quickly (especially with the big binary file),
you'll probably better off choosing strategy 2. The only drawback with
strategy 2 is when you need querying data that span over several days,
you'll have to issue many queries (one for each distinct day) or use the
"IN" clause of CQL3 but this has a small performance overhead since.

Do not forget to set gc_grace_seconds to 0 to have data removed quickly.

About notification, it's not possible right now to be notified on the
client side when an expiring column (column with TTL) is physically removed
by Cassandra




On Mon, Jun 30, 2014 at 9:59 AM, Wim Deblauwe <wi...@gmail.com>
wrote:

> Hi,
>
> I am getting started with Cassandra (coming from MySQL). I have made a
> table with timeseries data (inspired on
> http://planetcassandra.org/blog/post/getting-started-with-time-series-data-modeling/
> ).
>
> The table looks like this:
>
> CREATE TABLE event_message (
> message_id uuid,
> message_source_id uuid,
> message_time timestamp,
> event_type_id varchar,
> event_state varchar,
> filter_state varchar,
> image_id uuid,
> device_specific_id bigint,
> device_specific_begin_id bigint,
> characteristics varchar,
> PRIMARY KEY (message_source_id, message_time, message_id)
> );
>
> I have now 2 requirements:
> 1) I need to remove rows after a certain (user settable) time (between 5
> and 60 days). In MySQL, we used partitions by day to quickly delete a whole
> day.
> 2) I need to store a big binary file along with each row and this file
> should be removed when the row is removed.
>
> I was looking into the expiring columns (with the TTL), but is this a good
> fit for this use case? Is this TTL stored between restarts of Cassandra?
>
> Would there be any advantage to use the system called "Partitioning to
> limit row size – Time Series Pattern 2" in the URL and then explicitly
> doing a delete of a whole day? With this system, if I query by time, do I
> need to calculate what days are in the interval and explicitly add this in
> my query to find the good partitions?
>
> How can I get notifications if a row is expired when using TTL so I can
> removed the associated file?
>
> regards,
>
> Wim
>