You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Sylvain Lebresne <sy...@yakaz.com> on 2010/01/12 17:45:25 UTC

Cassandra and TTL

Hello,

I have to deal with a lot of different data and Cassandra seems to be a good
fit for my needs so far. However, some of this data is volatile by nature and
for those, I would need to set something akin to a TTL. Those TTL could be
long, but keeping those data forever would be useless.

I could deal with that by hand, writing some daemon that run regularly and
remove what should be removed. However this is not particularly efficient, nor
convenient, and I would find it really cool to be able to provide a TTL when
inserting something and don't have to care more than that.

Which leads me to my question: why Cassandra doesn't allow to set a TTL for
data ? Is it for technical reason ? For philosophical reason ? Or just nobody
had needed it sufficiently to write it ?

>From what I understand of how Cassandra works, it seems to me that it
could be done pretty efficiently (even though I agree that it wouldn't
be a minor
change). That is, it would require to add a ttl to column (and/or row). When
reading a column whose timestamp + ttl is expired, it would ignore it (as for
tombstoned column). Then during compaction, expired column would be
collected.

Is there any major difficulties/obstacles I don't see ?
Or maybe is there some trick I don't know about that allow to do such a thing
already ?

And if not, would that be something that would interest the Cassandra
community ? Or does nobody ever need such a thing ? (I personally believe it
to be a desirable feature, but maybe I am the only one.)

Thanks,
Sylvain

Re: Cassandra and TTL

Posted by Sylvain Lebresne <sy...@yakaz.com>.
> I would think you could consider parallels between why Oracle or BDBs
> do not have TTLs.

Actually I think that it could be useful sometimes in Oracle or BDBs. :)
(and just to clarify, I certainly do not advocate for having a
mandatory TTL, just
that it could be useful).
To add on that, I don't see how TTL's could be implemented efficiently
in Oracle and such (which may be why such DB don't bother), as there is no
"good" moment for collecting expired data (it's not like I'm an expert in DB
implementation, so please feel free to correct me if I'm telling
bullshit here).
But Cassandra already has the compaction phase that seems well suited for
that.

> My personal opinion is that this is because TTLs imply temporary
> (transient non-peristent) storing of things, and this is quite
> different from core functionality that Cassandra offers.
> Rather, memcached, or other caching systems would be more obvious choice.

I agree and I don't. I, at least, could have a use for a long lived
(say 6 month or
more) and potentially big cache, for which memcached and the like are
not really
made for and for which cassandra seems like a particularly good fit,
if only for the
lake of those TTL.

--
Sylvain

>
> -+ Tatu +-
>

Re: Cassandra and TTL

Posted by Tatu Saloranta <ts...@gmail.com>.
On Tue, Jan 12, 2010 at 8:45 AM, Sylvain Lebresne <sy...@yakaz.com> wrote:
> Hello,
>
...
> Which leads me to my question: why Cassandra doesn't allow to set a TTL for
> data ? Is it for technical reason ? For philosophical reason ? Or just nobody
> had needed it sufficiently to write it ?

I would think you could consider parallels between why Oracle or BDBs
do not have TTLs.

My personal opinion is that this is because TTLs imply temporary
(transient non-peristent) storing of things, and this is quite
different from core functionality that Cassandra offers.
Rather, memcached, or other caching systems would be more obvious choice.

-+ Tatu +-

Re: Cassandra and TTL

Posted by Kelvin Kakugawa <ka...@gmail.com>.
You're right, if the TTL will be dynamically set, then we'd need to
make room for it.  Otherwise, if it's globally set, we could save that
space.

-Kelvin

On Wed, Jan 13, 2010 at 1:16 PM, Kelvin Kakugawa <ka...@gmail.com> wrote:
> Are you thinking about storing the expiration time explicitly?  Or,
> would it be reasonable to calculate it dynamically?
>
> -Kelvin
>
> On Wed, Jan 13, 2010 at 1:01 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>> I think that is more or less what Sylvain is proposing.  The main
>> downside is adding the extra 8 bytes for a long (or 4 for an int,
>> which should actually be plenty of resolution for this use case) to
>> each Column object.
>>
>> On Wed, Jan 13, 2010 at 4:57 PM, Kelvin Kakugawa <ka...@gmail.com> wrote:
>>> An alternative implementation that may be worth exploring would be to
>>> modify IColumn's isMarkedForDelete() method to check TTL.
>>>
>>> It probably wouldn't be as performant as straight dropping SSTables.
>>> You'd probably also need to periodically compact old tables to remove
>>> expired rows.  However, on the surface, it appears to be a more
>>> seamless and fine-grained approach to this problem.
>>>
>>> -Kelvin
>>>
>>> A little more background:
>>> db.IColumn is the shared interface that db.Column and db.SuperColumn
>>> implement.  db.Column's isMarkedForDelete() method only checks if a
>>> flag has been set, right now.  So, it would be relatively
>>> straightforward to slip some logic into that method to check if its
>>> timestamp has expired beyond some TTL.
>>>
>>> However, I suspect that there may be other methods that may need to be
>>> slightly modified, as well.  And, the compaction code would have to be
>>> inspected to make sure that old tables are periodically compacted to
>>> remove expired rows.
>>>
>>> On Wed, Jan 13, 2010 at 12:30 PM, Mark Robson <ma...@gmail.com> wrote:
>>>> I also agree: Some mechanism to expire rolling data would be really good if
>>>> we can incorporate it. Using the existing client interface, deleting old
>>>> data is very cumbersome.
>>>>
>>>> We want to store lots of audit data in Cassandra, this will need to be
>>>> expired eventually.
>>>>
>>>> Nodes should be able to do expiry locally without needing to talk to other
>>>> nodes in the cluster. As we have a timestamp on everything anyway, can we
>>>> not use that somehow?
>>>>
>>>> If we only ever append data rather than update it (or update it very
>>>> rarely), can we somehow store timestamp ranges in each sstable file and then
>>>> have the server know when it's time to expire one?
>>>>
>>>> I'm guessing here from my limited understanding of how Cassandra works.
>>>>
>>>> Mark
>>>>
>>>
>>
>

Re: Cassandra and TTL

Posted by Ryan Daum <ry...@thimbleware.com>.
On Wed, Jan 13, 2010 at 6:19 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> If he needs column-level granularity then I don't see any other option.
>
> If he needs CF-level granularity then truncate will work fine. :)
>

Are you saying the proposed truncate functionality will support the
functionality of 'truncate all keys with timestamp < X" ?

R

Re: Cassandra and TTL

Posted by Jonathan Ellis <jb...@gmail.com>.
If he needs column-level granularity then I don't see any other option.

If he needs CF-level granularity then truncate will work fine. :)

On Wed, Jan 13, 2010 at 5:16 PM, Kelvin Kakugawa <ka...@gmail.com> wrote:
> Are you thinking about storing the expiration time explicitly?  Or,
> would it be reasonable to calculate it dynamically?
>
> -Kelvin
>
> On Wed, Jan 13, 2010 at 1:01 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>> I think that is more or less what Sylvain is proposing.  The main
>> downside is adding the extra 8 bytes for a long (or 4 for an int,
>> which should actually be plenty of resolution for this use case) to
>> each Column object.
>>
>> On Wed, Jan 13, 2010 at 4:57 PM, Kelvin Kakugawa <ka...@gmail.com> wrote:
>>> An alternative implementation that may be worth exploring would be to
>>> modify IColumn's isMarkedForDelete() method to check TTL.
>>>
>>> It probably wouldn't be as performant as straight dropping SSTables.
>>> You'd probably also need to periodically compact old tables to remove
>>> expired rows.  However, on the surface, it appears to be a more
>>> seamless and fine-grained approach to this problem.
>>>
>>> -Kelvin
>>>
>>> A little more background:
>>> db.IColumn is the shared interface that db.Column and db.SuperColumn
>>> implement.  db.Column's isMarkedForDelete() method only checks if a
>>> flag has been set, right now.  So, it would be relatively
>>> straightforward to slip some logic into that method to check if its
>>> timestamp has expired beyond some TTL.
>>>
>>> However, I suspect that there may be other methods that may need to be
>>> slightly modified, as well.  And, the compaction code would have to be
>>> inspected to make sure that old tables are periodically compacted to
>>> remove expired rows.
>>>
>>> On Wed, Jan 13, 2010 at 12:30 PM, Mark Robson <ma...@gmail.com> wrote:
>>>> I also agree: Some mechanism to expire rolling data would be really good if
>>>> we can incorporate it. Using the existing client interface, deleting old
>>>> data is very cumbersome.
>>>>
>>>> We want to store lots of audit data in Cassandra, this will need to be
>>>> expired eventually.
>>>>
>>>> Nodes should be able to do expiry locally without needing to talk to other
>>>> nodes in the cluster. As we have a timestamp on everything anyway, can we
>>>> not use that somehow?
>>>>
>>>> If we only ever append data rather than update it (or update it very
>>>> rarely), can we somehow store timestamp ranges in each sstable file and then
>>>> have the server know when it's time to expire one?
>>>>
>>>> I'm guessing here from my limited understanding of how Cassandra works.
>>>>
>>>> Mark
>>>>
>>>
>>
>

Re: Cassandra and TTL

Posted by Kelvin Kakugawa <ka...@gmail.com>.
Are you thinking about storing the expiration time explicitly?  Or,
would it be reasonable to calculate it dynamically?

-Kelvin

On Wed, Jan 13, 2010 at 1:01 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> I think that is more or less what Sylvain is proposing.  The main
> downside is adding the extra 8 bytes for a long (or 4 for an int,
> which should actually be plenty of resolution for this use case) to
> each Column object.
>
> On Wed, Jan 13, 2010 at 4:57 PM, Kelvin Kakugawa <ka...@gmail.com> wrote:
>> An alternative implementation that may be worth exploring would be to
>> modify IColumn's isMarkedForDelete() method to check TTL.
>>
>> It probably wouldn't be as performant as straight dropping SSTables.
>> You'd probably also need to periodically compact old tables to remove
>> expired rows.  However, on the surface, it appears to be a more
>> seamless and fine-grained approach to this problem.
>>
>> -Kelvin
>>
>> A little more background:
>> db.IColumn is the shared interface that db.Column and db.SuperColumn
>> implement.  db.Column's isMarkedForDelete() method only checks if a
>> flag has been set, right now.  So, it would be relatively
>> straightforward to slip some logic into that method to check if its
>> timestamp has expired beyond some TTL.
>>
>> However, I suspect that there may be other methods that may need to be
>> slightly modified, as well.  And, the compaction code would have to be
>> inspected to make sure that old tables are periodically compacted to
>> remove expired rows.
>>
>> On Wed, Jan 13, 2010 at 12:30 PM, Mark Robson <ma...@gmail.com> wrote:
>>> I also agree: Some mechanism to expire rolling data would be really good if
>>> we can incorporate it. Using the existing client interface, deleting old
>>> data is very cumbersome.
>>>
>>> We want to store lots of audit data in Cassandra, this will need to be
>>> expired eventually.
>>>
>>> Nodes should be able to do expiry locally without needing to talk to other
>>> nodes in the cluster. As we have a timestamp on everything anyway, can we
>>> not use that somehow?
>>>
>>> If we only ever append data rather than update it (or update it very
>>> rarely), can we somehow store timestamp ranges in each sstable file and then
>>> have the server know when it's time to expire one?
>>>
>>> I'm guessing here from my limited understanding of how Cassandra works.
>>>
>>> Mark
>>>
>>
>

Re: Cassandra and TTL

Posted by Jonathan Ellis <jb...@gmail.com>.
I think that is more or less what Sylvain is proposing.  The main
downside is adding the extra 8 bytes for a long (or 4 for an int,
which should actually be plenty of resolution for this use case) to
each Column object.

On Wed, Jan 13, 2010 at 4:57 PM, Kelvin Kakugawa <ka...@gmail.com> wrote:
> An alternative implementation that may be worth exploring would be to
> modify IColumn's isMarkedForDelete() method to check TTL.
>
> It probably wouldn't be as performant as straight dropping SSTables.
> You'd probably also need to periodically compact old tables to remove
> expired rows.  However, on the surface, it appears to be a more
> seamless and fine-grained approach to this problem.
>
> -Kelvin
>
> A little more background:
> db.IColumn is the shared interface that db.Column and db.SuperColumn
> implement.  db.Column's isMarkedForDelete() method only checks if a
> flag has been set, right now.  So, it would be relatively
> straightforward to slip some logic into that method to check if its
> timestamp has expired beyond some TTL.
>
> However, I suspect that there may be other methods that may need to be
> slightly modified, as well.  And, the compaction code would have to be
> inspected to make sure that old tables are periodically compacted to
> remove expired rows.
>
> On Wed, Jan 13, 2010 at 12:30 PM, Mark Robson <ma...@gmail.com> wrote:
>> I also agree: Some mechanism to expire rolling data would be really good if
>> we can incorporate it. Using the existing client interface, deleting old
>> data is very cumbersome.
>>
>> We want to store lots of audit data in Cassandra, this will need to be
>> expired eventually.
>>
>> Nodes should be able to do expiry locally without needing to talk to other
>> nodes in the cluster. As we have a timestamp on everything anyway, can we
>> not use that somehow?
>>
>> If we only ever append data rather than update it (or update it very
>> rarely), can we somehow store timestamp ranges in each sstable file and then
>> have the server know when it's time to expire one?
>>
>> I'm guessing here from my limited understanding of how Cassandra works.
>>
>> Mark
>>
>

Re: Cassandra and TTL

Posted by Kelvin Kakugawa <ka...@gmail.com>.
An alternative implementation that may be worth exploring would be to
modify IColumn's isMarkedForDelete() method to check TTL.

It probably wouldn't be as performant as straight dropping SSTables.
You'd probably also need to periodically compact old tables to remove
expired rows.  However, on the surface, it appears to be a more
seamless and fine-grained approach to this problem.

-Kelvin

A little more background:
db.IColumn is the shared interface that db.Column and db.SuperColumn
implement.  db.Column's isMarkedForDelete() method only checks if a
flag has been set, right now.  So, it would be relatively
straightforward to slip some logic into that method to check if its
timestamp has expired beyond some TTL.

However, I suspect that there may be other methods that may need to be
slightly modified, as well.  And, the compaction code would have to be
inspected to make sure that old tables are periodically compacted to
remove expired rows.

On Wed, Jan 13, 2010 at 12:30 PM, Mark Robson <ma...@gmail.com> wrote:
> I also agree: Some mechanism to expire rolling data would be really good if
> we can incorporate it. Using the existing client interface, deleting old
> data is very cumbersome.
>
> We want to store lots of audit data in Cassandra, this will need to be
> expired eventually.
>
> Nodes should be able to do expiry locally without needing to talk to other
> nodes in the cluster. As we have a timestamp on everything anyway, can we
> not use that somehow?
>
> If we only ever append data rather than update it (or update it very
> rarely), can we somehow store timestamp ranges in each sstable file and then
> have the server know when it's time to expire one?
>
> I'm guessing here from my limited understanding of how Cassandra works.
>
> Mark
>

Re: Cassandra and TTL

Posted by Jonathan Ellis <jb...@gmail.com>.
On Thu, Jan 14, 2010 at 6:04 AM, Sylvain Lebresne <sy...@yakaz.com> wrote:
> Created, #699 (I have yet to figure out how to assign it to me though).

Added you as a "contributor" so you can do this.

Re: Cassandra and TTL

Posted by Sylvain Lebresne <sy...@yakaz.com>.
On Thu, Jan 14, 2010 at 12:15 PM, Kelvin Kakugawa <ka...@gmail.com> wrote:
> If you're interested, why don't you create a new ticket and assign it
> to yourself.  I'd be happy to help you figure out which parts of the
> codebase need to be touched.

Created, #699 (I have yet to figure out how to assign it to me though).
Of course, any help/ideas are welcome.

> btw, I'm working on Issue #580 which will add versioning to Cassandra.
>  An aspect of this feature is that I am adding support for a new type
> of column that serializes different metadata to disk.  So, when my
> contribution gets pushed through, the strategy that I use would
> probably be able to support a new TTL column, as well.  So, you
> wouldn't have to penalize all of the currently existing column types.

Yes, I was also thinking of making a special kind of column, so that it
don't penalize non TTLed column. So I'm indeed interested by what you
do with #580. I'll have a look at what you have done already.

Thanks

--
Sylvain

>
> On Thu, Jan 14, 2010 at 12:17 AM, Sylvain Lebresne <sy...@yakaz.com> wrote:
>> On Thu, Jan 14, 2010 at 6:46 AM, August Zajonc <au...@augustz.com> wrote:
>>> I personally like this last option of expiring entire sstables. It seems
>>> significantly more efficient then scrubbing data. The granularity might be a
>>> bit high, but by columnfamily seems a reasonable trade-off in the short run
>>> for an easier solution.
>>>
>>> For apps that don't want to see the old data, during a read if the data had
>>> a timestamp older than the expire time on the ColumnFamily it could also be
>>> ignored, then when all in an sstable < x, truncate.
>>>
>>> Logs are a great example of this.
>>
>> I personally think that truncate and a potential TTL are two different
>> things, since as
>> you said, they have very different granularity. In particular, using
>> truncate basically
>> means that all the data in the CF you truncate have exactly the same
>> TTL. Plus,
>> you have to know if a data have a TTL (and which one if you want
>> different CF that you
>> truncate at different time) when you request it. I, for one, have many
>> use where it's not
>> really a desirable solution. Plus, truncate suppose that you
>> deactivate compaction for
>> you CF, which may or may not be affordable.
>> But sure, for some models, and logs are a great example indeed,
>> truncate will be the
>> perfect solution.
>>
>> Anyway, I'm starting to seriously consider giving this TTL idea a
>> shot. We'll see how
>> that goes.
>>
>> --
>> Sylvain
>>
>

Re: Cassandra and TTL

Posted by Kelvin Kakugawa <ka...@gmail.com>.
Sylvain,

If you're interested, why don't you create a new ticket and assign it
to yourself.  I'd be happy to help you figure out which parts of the
codebase need to be touched.

-Kelvin

btw, I'm working on Issue #580 which will add versioning to Cassandra.
 An aspect of this feature is that I am adding support for a new type
of column that serializes different metadata to disk.  So, when my
contribution gets pushed through, the strategy that I use would
probably be able to support a new TTL column, as well.  So, you
wouldn't have to penalize all of the currently existing column types.

On Thu, Jan 14, 2010 at 12:17 AM, Sylvain Lebresne <sy...@yakaz.com> wrote:
> On Thu, Jan 14, 2010 at 6:46 AM, August Zajonc <au...@augustz.com> wrote:
>> I personally like this last option of expiring entire sstables. It seems
>> significantly more efficient then scrubbing data. The granularity might be a
>> bit high, but by columnfamily seems a reasonable trade-off in the short run
>> for an easier solution.
>>
>> For apps that don't want to see the old data, during a read if the data had
>> a timestamp older than the expire time on the ColumnFamily it could also be
>> ignored, then when all in an sstable < x, truncate.
>>
>> Logs are a great example of this.
>
> I personally think that truncate and a potential TTL are two different
> things, since as
> you said, they have very different granularity. In particular, using
> truncate basically
> means that all the data in the CF you truncate have exactly the same
> TTL. Plus,
> you have to know if a data have a TTL (and which one if you want
> different CF that you
> truncate at different time) when you request it. I, for one, have many
> use where it's not
> really a desirable solution. Plus, truncate suppose that you
> deactivate compaction for
> you CF, which may or may not be affordable.
> But sure, for some models, and logs are a great example indeed,
> truncate will be the
> perfect solution.
>
> Anyway, I'm starting to seriously consider giving this TTL idea a
> shot. We'll see how
> that goes.
>
> --
> Sylvain
>

Re: Cassandra and TTL

Posted by Sylvain Lebresne <sy...@yakaz.com>.
On Thu, Jan 14, 2010 at 6:46 AM, August Zajonc <au...@augustz.com> wrote:
> I personally like this last option of expiring entire sstables. It seems
> significantly more efficient then scrubbing data. The granularity might be a
> bit high, but by columnfamily seems a reasonable trade-off in the short run
> for an easier solution.
>
> For apps that don't want to see the old data, during a read if the data had
> a timestamp older than the expire time on the ColumnFamily it could also be
> ignored, then when all in an sstable < x, truncate.
>
> Logs are a great example of this.

I personally think that truncate and a potential TTL are two different
things, since as
you said, they have very different granularity. In particular, using
truncate basically
means that all the data in the CF you truncate have exactly the same
TTL. Plus,
you have to know if a data have a TTL (and which one if you want
different CF that you
truncate at different time) when you request it. I, for one, have many
use where it's not
really a desirable solution. Plus, truncate suppose that you
deactivate compaction for
you CF, which may or may not be affordable.
But sure, for some models, and logs are a great example indeed,
truncate will be the
perfect solution.

Anyway, I'm starting to seriously consider giving this TTL idea a
shot. We'll see how
that goes.

--
Sylvain

Re: Cassandra and TTL

Posted by August Zajonc <au...@augustz.com>.
On Wed, Jan 13, 2010 at 2:30 PM, Mark Robson <ma...@gmail.com> wrote:

> I also agree: Some mechanism to expire rolling data would be really good if
> we can incorporate it. Using the existing client interface, deleting old
> data is very cumbersome.
>
> We want to store lots of audit data in Cassandra, this will need to be
> expired eventually.
>
> Nodes should be able to do expiry locally without needing to talk to other
> nodes in the cluster. As we have a timestamp on everything anyway, can we
> not use that somehow?
>
> If we only ever append data rather than update it (or update it very
> rarely), can we somehow store timestamp ranges in each sstable file and then
> have the server know when it's time to expire one?
>
> I personally like this last option of expiring entire sstables. It seems
significantly more efficient then scrubbing data. The granularity might be a
bit high, but by columnfamily seems a reasonable trade-off in the short run
for an easier solution.

For apps that don't want to see the old data, during a read if the data had
a timestamp older than the expire time on the ColumnFamily it could also be
ignored, then when all in an sstable < x, truncate.

Logs are a great example of this.

- August

Re: Cassandra and TTL

Posted by Mark Robson <ma...@gmail.com>.
I also agree: Some mechanism to expire rolling data would be really good if
we can incorporate it. Using the existing client interface, deleting old
data is very cumbersome.

We want to store lots of audit data in Cassandra, this will need to be
expired eventually.

Nodes should be able to do expiry locally without needing to talk to other
nodes in the cluster. As we have a timestamp on everything anyway, can we
not use that somehow?

If we only ever append data rather than update it (or update it very
rarely), can we somehow store timestamp ranges in each sstable file and then
have the server know when it's time to expire one?

I'm guessing here from my limited understanding of how Cassandra works.

Mark

Re: Cassandra and TTL

Posted by Tatu Saloranta <ts...@gmail.com>.
On Wed, Jan 13, 2010 at 6:18 AM, Ryan Daum <ry...@thimbleware.com> wrote:
> Just to speak up here, I think it's a more common use-case than you're
> imagining, eve if maybe there's no reasonable way of implementing it.
> I for one have plenty of use for a TTL on a key, though in my case the TTL
> would be in days/weeks.

I misunderstood the original question -- I think that the use case of
long-term reaping of obsolete entries is much more relevant than
short-term cache expiration. So my comments were mostly off the mark.
This is indeed often done with Oracle DBs too, with rolling
weekly/monthly partitions and other constructs.

So I actually agree in that mechanisms for supporting that would be
useful, now that I understand the request. :-)

-+ Tatu +-

Re: Cassandra and TTL

Posted by Ryan Daum <ry...@thimbleware.com>.
Just to speak up here, I think it's a more common use-case than you're
imagining, eve if maybe there's no reasonable way of implementing it.

I for one have plenty of use for a TTL on a key, though in my case the TTL
would be in days/weeks.

Alternatively, I know it's considered "wrong", but having a way of getting
all unique keys + timestamps from a RandomPartitioner would allow me to do
manual scavenging of my own. sstable2json is perhaps not appropriate because
it includes replicated data.

On Tue, Jan 12, 2010 at 11:56 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> I'm skeptical that this is a common use-case...  If truncating old
> sstables entirely
> (https://issues.apache.org/jira/browse/CASSANDRA-531) meets your
> needs, that is going to be less work and more performant.
>
> -Jonathan
>
> On Tue, Jan 12, 2010 at 10:45 AM, Sylvain Lebresne <sy...@yakaz.com>
> wrote:
> > Hello,
> >
> > I have to deal with a lot of different data and Cassandra seems to be a
> good
> > fit for my needs so far. However, some of this data is volatile by nature
> and
> > for those, I would need to set something akin to a TTL. Those TTL could
> be
> > long, but keeping those data forever would be useless.
> >
> > I could deal with that by hand, writing some daemon that run regularly
> and
> > remove what should be removed. However this is not particularly
> efficient, nor
> > convenient, and I would find it really cool to be able to provide a TTL
> when
> > inserting something and don't have to care more than that.
> >
> > Which leads me to my question: why Cassandra doesn't allow to set a TTL
> for
> > data ? Is it for technical reason ? For philosophical reason ? Or just
> nobody
> > had needed it sufficiently to write it ?
> >
> > From what I understand of how Cassandra works, it seems to me that it
> > could be done pretty efficiently (even though I agree that it wouldn't
> > be a minor
> > change). That is, it would require to add a ttl to column (and/or row).
> When
> > reading a column whose timestamp + ttl is expired, it would ignore it (as
> for
> > tombstoned column). Then during compaction, expired column would be
> > collected.
> >
> > Is there any major difficulties/obstacles I don't see ?
> > Or maybe is there some trick I don't know about that allow to do such a
> thing
> > already ?
> >
> > And if not, would that be something that would interest the Cassandra
> > community ? Or does nobody ever need such a thing ? (I personally believe
> it
> > to be a desirable feature, but maybe I am the only one.)
> >
> > Thanks,
> > Sylvain
> >
>

Re: Cassandra and TTL

Posted by Jonathan Ellis <jb...@gmail.com>.
On Tue, Jan 12, 2010 at 3:43 PM, Sylvain Lebresne <sy...@yakaz.com> wrote:
>> Right, that is why the ticket says you would want to disable
>> compaction if you want to truncate less than the whole CF.
>
> Indeed, make sense now. But disabling compaction would have
> an impact both on the data size and on the read performance, right ?
> I suppose that it would mostly be viable if you truncate often enough.
> Do we have an idea of the kind of impact disabling compaction would
> have ?

Depends on how many overwrites you are doing.  If each row is only
present in one or two sstables it is not so bad, but if it needs to
combine a few dozen row versions for each query, that will get
painful.

So yes, that suggestion is primarily intended for "I can keep the
number of sstables relatively small but truncating often and using a
large memtable size."

-Jonathan

Re: Cassandra and TTL

Posted by Sylvain Lebresne <sy...@yakaz.com>.
On Tue, Jan 12, 2010 at 10:29 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> On Tue, Jan 12, 2010 at 2:39 PM, Sylvain Lebresne <sy...@yakaz.com> wrote:
>>> If truncating old sstables entirely
>>> (https://issues.apache.org/jira/browse/CASSANDRA-531) meets your
>>> needs, that is going to be less work and more performant.
>>
>> Well, I'm not sure I understand completely this ticket. The part in the
>> comment saying "drop all sstables older than X" seems to be something helpful.
>> But aren't sstables regularly merged together, thus mixing "older" data with
>> newer data ?
>
> Right, that is why the ticket says you would want to disable
> compaction if you want to truncate less than the whole CF.

Indeed, make sense now. But disabling compaction would have
an impact both on the data size and on the read performance, right ?
I suppose that it would mostly be viable if you truncate often enough.
Do we have an idea of the kind of impact disabling compaction would
have ?


>> That is, is this 'truncate with a timestamp t' always remove *all* columns
>> with a timestamp older than t ?
>
> Yes.

Thanks a lot for the clarification.

--
Sylvain

Re: Cassandra and TTL

Posted by Jonathan Ellis <jb...@gmail.com>.
On Tue, Jan 12, 2010 at 2:39 PM, Sylvain Lebresne <sy...@yakaz.com> wrote:
>> If truncating old sstables entirely
>> (https://issues.apache.org/jira/browse/CASSANDRA-531) meets your
>> needs, that is going to be less work and more performant.
>
> Well, I'm not sure I understand completely this ticket. The part in the
> comment saying "drop all sstables older than X" seems to be something helpful.
> But aren't sstables regularly merged together, thus mixing "older" data with
> newer data ?

Right, that is why the ticket says you would want to disable
compaction if you want to truncate less than the whole CF.

> That is, is this 'truncate with a timestamp t' always remove *all* columns
> with a timestamp older than t ?

Yes.

-Jonathan

Re: Cassandra and TTL

Posted by Sylvain Lebresne <sy...@yakaz.com>.
> I'm skeptical that this is a common use-case...

Fair enough. The idea is a sort of long lived cache, as I want to store (big
volume of) crawled document. But I can't (and don't want) to keep those
documents forever, hence this TTL idea. Agreed this is a rather specific
use-case, but I can imagine other kind of data that you may not want to keep
forever (news articles, old data (say a tweet) that hasn't been use in a long
time, ...). Having to track such "outdated" data manually is just no fun. But
maybe it's not that common...

> If truncating old sstables entirely
> (https://issues.apache.org/jira/browse/CASSANDRA-531) meets your
> needs, that is going to be less work and more performant.

Well, I'm not sure I understand completely this ticket. The part in the
comment saying "drop all sstables older than X" seems to be something helpful.
But aren't sstables regularly merged together, thus mixing "older" data with
newer data ?
That is, is this 'truncate with a timestamp t' always remove *all* columns
with a timestamp older than t ?

Thanks
--
Sylvain


> On Tue, Jan 12, 2010 at 10:45 AM, Sylvain Lebresne <sy...@yakaz.com> wrote:
>> Hello,
>>
>> I have to deal with a lot of different data and Cassandra seems to be a good
>> fit for my needs so far. However, some of this data is volatile by nature and
>> for those, I would need to set something akin to a TTL. Those TTL could be
>> long, but keeping those data forever would be useless.
>>
>> I could deal with that by hand, writing some daemon that run regularly and
>> remove what should be removed. However this is not particularly efficient, nor
>> convenient, and I would find it really cool to be able to provide a TTL when
>> inserting something and don't have to care more than that.
>>
>> Which leads me to my question: why Cassandra doesn't allow to set a TTL for
>> data ? Is it for technical reason ? For philosophical reason ? Or just nobody
>> had needed it sufficiently to write it ?
>>
>> From what I understand of how Cassandra works, it seems to me that it
>> could be done pretty efficiently (even though I agree that it wouldn't
>> be a minor
>> change). That is, it would require to add a ttl to column (and/or row). When
>> reading a column whose timestamp + ttl is expired, it would ignore it (as for
>> tombstoned column). Then during compaction, expired column would be
>> collected.
>>
>> Is there any major difficulties/obstacles I don't see ?
>> Or maybe is there some trick I don't know about that allow to do such a thing
>> already ?
>>
>> And if not, would that be something that would interest the Cassandra
>> community ? Or does nobody ever need such a thing ? (I personally believe it
>> to be a desirable feature, but maybe I am the only one.)
>>
>> Thanks,
>> Sylvain
>>
>

Re: Cassandra and TTL

Posted by Jonathan Ellis <jb...@gmail.com>.
I'm skeptical that this is a common use-case...  If truncating old
sstables entirely
(https://issues.apache.org/jira/browse/CASSANDRA-531) meets your
needs, that is going to be less work and more performant.

-Jonathan

On Tue, Jan 12, 2010 at 10:45 AM, Sylvain Lebresne <sy...@yakaz.com> wrote:
> Hello,
>
> I have to deal with a lot of different data and Cassandra seems to be a good
> fit for my needs so far. However, some of this data is volatile by nature and
> for those, I would need to set something akin to a TTL. Those TTL could be
> long, but keeping those data forever would be useless.
>
> I could deal with that by hand, writing some daemon that run regularly and
> remove what should be removed. However this is not particularly efficient, nor
> convenient, and I would find it really cool to be able to provide a TTL when
> inserting something and don't have to care more than that.
>
> Which leads me to my question: why Cassandra doesn't allow to set a TTL for
> data ? Is it for technical reason ? For philosophical reason ? Or just nobody
> had needed it sufficiently to write it ?
>
> From what I understand of how Cassandra works, it seems to me that it
> could be done pretty efficiently (even though I agree that it wouldn't
> be a minor
> change). That is, it would require to add a ttl to column (and/or row). When
> reading a column whose timestamp + ttl is expired, it would ignore it (as for
> tombstoned column). Then during compaction, expired column would be
> collected.
>
> Is there any major difficulties/obstacles I don't see ?
> Or maybe is there some trick I don't know about that allow to do such a thing
> already ?
>
> And if not, would that be something that would interest the Cassandra
> community ? Or does nobody ever need such a thing ? (I personally believe it
> to be a desirable feature, but maybe I am the only one.)
>
> Thanks,
> Sylvain
>