You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Jeffrey Wang <jw...@palantir.com> on 2011/02/03 06:08:00 UTC

rolling window of data

Hi,

We're trying to use Cassandra 0.7 to store a rolling window of log data (e.g. last 90 days). We use the timestamp of the log entries as the column names so we can do time range queries. Everything seems to be working fine, but it's not clear if there is an efficient way to delete data that is more than 90 days old.

Originally I thought that using a slice range on a deletion would do the trick, but that apparently is not supported yet. Another idea I had was to store the timestamp of the log entry as Cassandra's timestamp and pass in artificial timestamps to remove (thrift API), but that seems hacky. Does anyone know if there is a good way to support this kind of rolling window of data efficiently? Thanks.

-Jeffrey


Re: rolling window of data

Posted by Jonathan Ellis <jb...@gmail.com>.
On Thu, Feb 3, 2011 at 3:59 PM, Jeffrey Wang <jw...@palantir.com> wrote:
> To be a little more clear, a simplified version of what I'm asking is:
>
> Let's say you add 1K columns with timestamps 1 to 1000. Then, at an arbitrarily distant point in the future, if you call remove on that CF with timestamp 500 (so the timestamps are logically out of order), will it delete exactly half of it

Yes.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

RE: rolling window of data

Posted by Jeffrey Wang <jw...@palantir.com>.
To be a little more clear, a simplified version of what I'm asking is:

Let's say you add 1K columns with timestamps 1 to 1000. Then, at an arbitrarily distant point in the future, if you call remove on that CF with timestamp 500 (so the timestamps are logically out of order), will it delete exactly half of it or is there stuff that might go on under the covers that makes this not work as you might expect?

-Jeffrey

-----Original Message-----
From: Jeffrey Wang [mailto:jwang@palantir.com] 
Sent: Thursday, February 03, 2011 3:03 PM
To: user@cassandra.apache.org
Subject: RE: rolling window of data

Thanks for the response, but unfortunately a TTL is not enough for us. We would like to be able to dynamically control the window in case there is an unusually large amount of data or something so we don't run out of disk space.

One question I have in particular is: if I use the timestamp of my log entries (not necessarily correlated at all with the timestamp of insert) as the timestamp on my mutations will Cassandra do the right thing when I delete? We don't have any need for conflict resolution, so we are currently just using the current time.

It seems like there is a possibility, depending on the implementation details of Cassandra, that I could call a remove with a timestamp for which everything before that should get deleted. Like I said before, this seems a bit hacky to me, but would it get the job done?

-Jeffrey

-----Original Message-----
From: scode@scode.org [mailto:scode@scode.org] On Behalf Of Peter Schuller
Sent: Thursday, February 03, 2011 8:48 AM
To: user@cassandra.apache.org
Subject: Re: rolling window of data

> The correct way to accomplish what you describe is the new (in 0.7)
> per-column TTL.  Simply set this to 60 * 60 * 24 * 90 (90 day's worth of
> seconds) and your columns will magically disappear after that length of
> time.

Although that assumes it's okay to loose data or that there is some
other method in place to prevent loss of it should the data not be
processed to whatever extent is required.

TTL:s would be a great way to efficiently achieve the windowing, but
it does remove the ability to explicitly control exactly when data is
removed (such as after certain batch processing of it has completed).

-- 
/ Peter Schuller

RE: rolling window of data

Posted by Jeffrey Wang <jw...@palantir.com>.
Thanks for the response, but unfortunately a TTL is not enough for us. We would like to be able to dynamically control the window in case there is an unusually large amount of data or something so we don't run out of disk space.

One question I have in particular is: if I use the timestamp of my log entries (not necessarily correlated at all with the timestamp of insert) as the timestamp on my mutations will Cassandra do the right thing when I delete? We don't have any need for conflict resolution, so we are currently just using the current time.

It seems like there is a possibility, depending on the implementation details of Cassandra, that I could call a remove with a timestamp for which everything before that should get deleted. Like I said before, this seems a bit hacky to me, but would it get the job done?

-Jeffrey

-----Original Message-----
From: scode@scode.org [mailto:scode@scode.org] On Behalf Of Peter Schuller
Sent: Thursday, February 03, 2011 8:48 AM
To: user@cassandra.apache.org
Subject: Re: rolling window of data

> The correct way to accomplish what you describe is the new (in 0.7)
> per-column TTL.  Simply set this to 60 * 60 * 24 * 90 (90 day's worth of
> seconds) and your columns will magically disappear after that length of
> time.

Although that assumes it's okay to loose data or that there is some
other method in place to prevent loss of it should the data not be
processed to whatever extent is required.

TTL:s would be a great way to efficiently achieve the windowing, but
it does remove the ability to explicitly control exactly when data is
removed (such as after certain batch processing of it has completed).

-- 
/ Peter Schuller

Re: rolling window of data

Posted by Peter Schuller <pe...@infidyne.com>.
> The correct way to accomplish what you describe is the new (in 0.7)
> per-column TTL.  Simply set this to 60 * 60 * 24 * 90 (90 day's worth of
> seconds) and your columns will magically disappear after that length of
> time.

Although that assumes it's okay to loose data or that there is some
other method in place to prevent loss of it should the data not be
processed to whatever extent is required.

TTL:s would be a great way to efficiently achieve the windowing, but
it does remove the ability to explicitly control exactly when data is
removed (such as after certain batch processing of it has completed).

-- 
/ Peter Schuller

Re: rolling window of data

Posted by Tyler Hobbs <ty...@datastax.com>.
No, Logsandra does not use a rolling window.

The correct way to accomplish what you describe is the new (in 0.7)
per-column TTL.  Simply set this to 60 * 60 * 24 * 90 (90 day's worth of
seconds) and your columns will magically disappear after that length of
time.

- Tyler

On Wed, Feb 2, 2011 at 11:46 PM, Jeffrey Wang <jw...@palantir.com> wrote:

> Thanks for the link, but unfortunately it doesn’t look like it uses a
> rolling window. As far as I can tell, log entries just keep getting inserted
> into Cassandra.
>
>
>
> -Jeffrey
>
>
>
> *From:* Aaron Morton [mailto:aaron@thelastpickle.com]
> *Sent:* Wednesday, February 02, 2011 9:21 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: rolling window of data
>
>
>
> This project may provide some inspiration for you
> https://github.com/thobbs/logsandra
>
>
>
> Not sure if it has a rolling window, if you find out let me know :)
>
>
>
> Aaron
>
>
>
> On 03 Feb, 2011,at 06:08 PM, Jeffrey Wang <jw...@palantir.com> wrote:
>
> Hi,
>
>
>
> We’re trying to use Cassandra 0.7 to store a rolling window of log data
> (e.g. last 90 days). We use the timestamp of the log entries as the column
> names so we can do time range queries. Everything seems to be working fine,
> but it’s not clear if there is an efficient way to delete data that is more
> than 90 days old.
>
>
>
> Originally I thought that using a slice range on a deletion would do the
> trick, but that apparently is not supported yet. Another idea I had was to
> store the timestamp of the log entry as Cassandra’s timestamp and pass in
> artificial timestamps to remove (thrift API), but that seems hacky. Does
> anyone know if there is a good way to support this kind of rolling window of
> data efficiently? Thanks.
>
>
>
> -Jeffrey
>
>
>
>

RE: rolling window of data

Posted by Jeffrey Wang <jw...@palantir.com>.
Thanks for the link, but unfortunately it doesn't look like it uses a rolling window. As far as I can tell, log entries just keep getting inserted into Cassandra.

-Jeffrey

From: Aaron Morton [mailto:aaron@thelastpickle.com]
Sent: Wednesday, February 02, 2011 9:21 PM
To: user@cassandra.apache.org
Subject: Re: rolling window of data

This project may provide some inspiration for you https://github.com/thobbs/logsandra

Not sure if it has a rolling window, if you find out let me know :)

Aaron


On 03 Feb, 2011,at 06:08 PM, Jeffrey Wang <jw...@palantir.com> wrote:
Hi,

We're trying to use Cassandra 0.7 to store a rolling window of log data (e.g. last 90 days). We use the timestamp of the log entries as the column names so we can do time range queries. Everything seems to be working fine, but it's not clear if there is an efficient way to delete data that is more than 90 days old.

Originally I thought that using a slice range on a deletion would do the trick, but that apparently is not supported yet. Another idea I had was to store the timestamp of the log entry as Cassandra's timestamp and pass in artificial timestamps to remove (thrift API), but that seems hacky. Does anyone know if there is a good way to support this kind of rolling window of data efficiently? Thanks.

-Jeffrey


Re: rolling window of data

Posted by Aaron Morton <aa...@thelastpickle.com>.
This project may provide some inspiration for you https://github.com/thobbs/logsandra

Not sure if it has a rolling window, if you find out let me know :) 

Aaron


On 03 Feb, 2011,at 06:08 PM, Jeffrey Wang <jw...@palantir.com> wrote:

Hi,
 
We’re trying to use Cassandra 0.7 to store a rolling window of log data (e.g. last 90 days). We use the timestamp of the log entries as the column names so we can do time range queries. Everything seems to be working fine, but it’s not clear if there is an efficient way to delete data that is more than 90 days old.
 
Originally I thought that using a slice range on a deletion would do the trick, but that apparently is not supported yet. Another idea I had was to store the timestamp of the log entry as Cassandra’s timestamp and pass in artificial timestamps to remove (thrift API), but that seems hacky Does anyone know if there is a good way to support this kind of rolling window of data efficiently? Thanks.
 
-Jeffrey