You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by SE...@homedepot.com on 2016/01/22 16:27:53 UTC

RE: Using TTL for data purge

An upsert is a second insert. Cassandra’s sstables are immutable. There are no real “overwrites” (of the data on disk). It is another record/row. Upon read, it acts like an overwrite, because Cassandra will read both inserts and take the last one in as the correct data. This strategy will work for changing the TTL (and anything else that changes in the data).

Compaction creates a new sstable from existing ones. It will (if the inserts are in the compacted sstables) write only the latest data, so the older insert is effectively deleted/dropped from the new sstable now on disk.

As I understand TTL, if there is a compaction of a cell (or row) with a TTL that has been reached, a tombstone will be written.

Sean Durity – Lead Cassandra Admin
Big DATA Team
For support, create a JIRA<https://portal.homedepot.com/sites/bigdata/Shared%20Documents/Jira%20Hadoop%20Support%20Workflow.pdf>

From: Joseph TechMails [mailto:jaalex.tech@gmail.com]
Sent: Wednesday, December 30, 2015 3:59 AM
To: user@cassandra.apache.org
Subject: Re: Using TTL for data purge

Thanks, Sean. Our usecase is to delete records after few months of inactivity, and that period is fixed, but the TTL could get reset if the record is accessed within that timeframe - similar to extending a session. All reads are done based on the key, and there would be multiple upserts (all columns are re-INSERTed, including TTL) while it's active, so it's not exactly write-once/read-many. Are there any overheads for processes like compaction due to this overwriting of TTL? . I guess reads won't be affected since it's always done with the key, and won't have to filter out tombstones.

Regarding the data size, i could see a small decrease in the disk usage (du) of the "data" directory immediately after the rows with TTL expired, and still further reduction after running compaction on the CF (though this wasn't replicable always). Since the tombstones should ideally stay for 10 days, i assume this observation is not related to data expiry. Please confirm

Thanks,
Joseph


On Tue, Dec 29, 2015 at 11:20 PM, <SE...@homedepot.com>> wrote:
If you know how long the records should last, TTL is a good way to go. Remember that neither TTL or deletes are right-away purge strategies. Each inserts a special record called a tombstone to indicate a deleted record. After compaction (that is after gc_grace_seconds for the table, default 10 days), the data will be removed and you will regain disk space.

If the data is relatively volatile and read speeds are important, you might look at leveled compaction, though it can keep your nodes a bit busier than size-tiered. (An issue with size-tiered, over time, is that the tombstoned data in the larger and older sstables may rarely, if ever, get compacted out.)


Sean Durity – Lead Cassandra Admin
From: jaalex.tech [mailto:jaalex.tech@gmail.com<ma...@gmail.com>]
Sent: Tuesday, December 22, 2015 4:36 AM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Using TTL for data purge

Hi,

I'm looking for suggestions/caveats on using TTL as a subsitute for a manual data purge job.

We have few tables that hold user information - this could be guest or registered users, and there could be between 500K to 1M records created per day per table. Currently, these tables have a secondary indexed updated_date column which is populated on each update. However, we have been getting timeouts when running queries using updated_date when the number of records are high, so i don't think this would be a reliable option in the long term when we need to purge records that have not been used for the last X days.

In this scenario, is it advisable to include a high enough TTL (i.e the amount of time we want these to last, could be 3 to 6 months) when inserting/updating records?

There could be cases where the TTL may get reset after couple of days/weeks, when the user visits the site again.

The tables have fixed number of columns, except for one which has a clustering key, and may have max 10 entries per  partition key.

I need to know the overhead of having so many rows with TTL hanging around for a relatively longer duration (weeks/months), and the impacts it could have on performance/storage. If this is not a recommended approach, what would be an alternate design which could be used for a manual purge job, without using secondary indices.

We are using Cassandra 2.0.x.

Thanks,
Joseph


________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.


________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

RE: Using TTL for data purge

Posted by SE...@homedepot.com.
Thank, I appreciate the correction to my understanding.


Sean Durity

From: Jeff Jirsa [mailto:jeff.jirsa@crowdstrike.com]
Sent: Friday, January 22, 2016 1:04 PM
To: user@cassandra.apache.org
Subject: Re: Using TTL for data purge

"As I understand TTL, if there is a compaction of a cell (or row) with a TTL that has been reached, a tombstone will be written.”

The expiring cell is treated as a tombstone once it reaches it’s end of life, it does not write an additional tombstone to disk.


From: "SEAN_R_DURITY@homedepot.com<ma...@homedepot.com>"
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
Date: Friday, January 22, 2016 at 7:27 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
Subject: RE: Using TTL for data purge

An upsert is a second insert. Cassandra’s sstables are immutable. There are no real “overwrites” (of the data on disk). It is another record/row. Upon read, it acts like an overwrite, because Cassandra will read both inserts and take the last one in as the correct data. This strategy will work for changing the TTL (and anything else that changes in the data).

Compaction creates a new sstable from existing ones. It will (if the inserts are in the compacted sstables) write only the latest data, so the older insert is effectively deleted/dropped from the new sstable now on disk.

As I understand TTL, if there is a compaction of a cell (or row) with a TTL that has been reached, a tombstone will be written.

Sean Durity – Lead Cassandra Admin
Big DATA Team
For support, create a JIRA<https://portal.homedepot.com/sites/bigdata/Shared%20Documents/Jira%20Hadoop%20Support%20Workflow.pdf>

From: Joseph TechMails [mailto:jaalex.tech@gmail.com]
Sent: Wednesday, December 30, 2015 3:59 AM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Using TTL for data purge

Thanks, Sean. Our usecase is to delete records after few months of inactivity, and that period is fixed, but the TTL could get reset if the record is accessed within that timeframe - similar to extending a session. All reads are done based on the key, and there would be multiple upserts (all columns are re-INSERTed, including TTL) while it's active, so it's not exactly write-once/read-many. Are there any overheads for processes like compaction due to this overwriting of TTL? . I guess reads won't be affected since it's always done with the key, and won't have to filter out tombstones.

Regarding the data size, i could see a small decrease in the disk usage (du) of the "data" directory immediately after the rows with TTL expired, and still further reduction after running compaction on the CF (though this wasn't replicable always). Since the tombstones should ideally stay for 10 days, i assume this observation is not related to data expiry. Please confirm

Thanks,
Joseph


On Tue, Dec 29, 2015 at 11:20 PM, <SE...@homedepot.com>> wrote:
If you know how long the records should last, TTL is a good way to go. Remember that neither TTL or deletes are right-away purge strategies. Each inserts a special record called a tombstone to indicate a deleted record. After compaction (that is after gc_grace_seconds for the table, default 10 days), the data will be removed and you will regain disk space.

If the data is relatively volatile and read speeds are important, you might look at leveled compaction, though it can keep your nodes a bit busier than size-tiered. (An issue with size-tiered, over time, is that the tombstoned data in the larger and older sstables may rarely, if ever, get compacted out.)


Sean Durity – Lead Cassandra Admin
From: jaalex.tech [mailto:jaalex.tech@gmail.com<ma...@gmail.com>]
Sent: Tuesday, December 22, 2015 4:36 AM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Using TTL for data purge

Hi,

I'm looking for suggestions/caveats on using TTL as a subsitute for a manual data purge job.

We have few tables that hold user information - this could be guest or registered users, and there could be between 500K to 1M records created per day per table. Currently, these tables have a secondary indexed updated_date column which is populated on each update. However, we have been getting timeouts when running queries using updated_date when the number of records are high, so i don't think this would be a reliable option in the long term when we need to purge records that have not been used for the last X days.

In this scenario, is it advisable to include a high enough TTL (i.e the amount of time we want these to last, could be 3 to 6 months) when inserting/updating records?

There could be cases where the TTL may get reset after couple of days/weeks, when the user visits the site again.

The tables have fixed number of columns, except for one which has a clustering key, and may have max 10 entries per  partition key.

I need to know the overhead of having so many rows with TTL hanging around for a relatively longer duration (weeks/months), and the impacts it could have on performance/storage. If this is not a recommended approach, what would be an alternate design which could be used for a manual purge job, without using secondary indices.

We are using Cassandra 2.0.x.

Thanks,
Joseph


________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.


________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

Re: Using TTL for data purge

Posted by Jeff Jirsa <je...@crowdstrike.com>.
"As I understand TTL, if there is a compaction of a cell (or row) with a TTL that has been reached, a tombstone will be written.”

The expiring cell is treated as a tombstone once it reaches it’s end of life, it does not write an additional tombstone to disk.



From:  "SEAN_R_DURITY@homedepot.com"
Reply-To:  "user@cassandra.apache.org"
Date:  Friday, January 22, 2016 at 7:27 AM
To:  "user@cassandra.apache.org"
Subject:  RE: Using TTL for data purge

An upsert is a second insert. Cassandra’s sstables are immutable. There are no real “overwrites” (of the data on disk). It is another record/row. Upon read, it acts like an overwrite, because Cassandra will read both inserts and take the last one in as the correct data. This strategy will work for changing the TTL (and anything else that changes in the data).

 

Compaction creates a new sstable from existing ones. It will (if the inserts are in the compacted sstables) write only the latest data, so the older insert is effectively deleted/dropped from the new sstable now on disk.

 

As I understand TTL, if there is a compaction of a cell (or row) with a TTL that has been reached, a tombstone will be written.

 

Sean Durity – Lead Cassandra Admin

Big DATA Team

For support, create a JIRA

 

From: Joseph TechMails [mailto:jaalex.tech@gmail.com] 
Sent: Wednesday, December 30, 2015 3:59 AM
To: user@cassandra.apache.org
Subject: Re: Using TTL for data purge

 

Thanks, Sean. Our usecase is to delete records after few months of inactivity, and that period is fixed, but the TTL could get reset if the record is accessed within that timeframe - similar to extending a session. All reads are done based on the key, and there would be multiple upserts (all columns are re-INSERTed, including TTL) while it's active, so it's not exactly write-once/read-many. Are there any overheads for processes like compaction due to this overwriting of TTL? . I guess reads won't be affected since it's always done with the key, and won't have to filter out tombstones.

 

Regarding the data size, i could see a small decrease in the disk usage (du) of the "data" directory immediately after the rows with TTL expired, and still further reduction after running compaction on the CF (though this wasn't replicable always). Since the tombstones should ideally stay for 10 days, i assume this observation is not related to data expiry. Please confirm

 

Thanks,

Joseph

 

 

On Tue, Dec 29, 2015 at 11:20 PM, <SE...@homedepot.com> wrote:

If you know how long the records should last, TTL is a good way to go. Remember that neither TTL or deletes are right-away purge strategies. Each inserts a special record called a tombstone to indicate a deleted record. After compaction (that is after gc_grace_seconds for the table, default 10 days), the data will be removed and you will regain disk space.

 

If the data is relatively volatile and read speeds are important, you might look at leveled compaction, though it can keep your nodes a bit busier than size-tiered. (An issue with size-tiered, over time, is that the tombstoned data in the larger and older sstables may rarely, if ever, get compacted out.)

 

 

Sean Durity – Lead Cassandra Admin

From: jaalex.tech [mailto:jaalex.tech@gmail.com] 
Sent: Tuesday, December 22, 2015 4:36 AM
To: user@cassandra.apache.org
Subject: Using TTL for data purge

 

Hi,

 

I'm looking for suggestions/caveats on using TTL as a subsitute for a manual data purge job. 

 

We have few tables that hold user information - this could be guest or registered users, and there could be between 500K to 1M records created per day per table. Currently, these tables have a secondary indexed updated_date column which is populated on each update. However, we have been getting timeouts when running queries using updated_date when the number of records are high, so i don't think this would be a reliable option in the long term when we need to purge records that have not been used for the last X days. 

 

In this scenario, is it advisable to include a high enough TTL (i.e the amount of time we want these to last, could be 3 to 6 months) when inserting/updating records? 

 

There could be cases where the TTL may get reset after couple of days/weeks, when the user visits the site again.

 

The tables have fixed number of columns, except for one which has a clustering key, and may have max 10 entries per  partition key.

 

I need to know the overhead of having so many rows with TTL hanging around for a relatively longer duration (weeks/months), and the impacts it could have on performance/storage. If this is not a recommended approach, what would be an alternate design which could be used for a manual purge job, without using secondary indices.

 

We are using Cassandra 2.0.x.

 

Thanks,

Joseph

 

 


The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

 


The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.