You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Sylvain Lebresne <sy...@yakaz.com> on 2010/04/01 09:45:47 UTC

Re: expiring data out of Cassandra/time to live

> On that topic, what exactly is keeping this feature out of the official
> releases?

The patch changes the thrift API. Among possibly other reason, I think it was
one reason why it wasn't even consider for inclusion in the 0.6 branch. As for
trunk (and for the future 0.7 thus), there is scheduled internal
changes (vector
clocks and changes to the SSTable format at least) that will force
this patch to
be rewritten somehow.
I think that is part of the reasons why it is not yet included. But of
course, that
being said, I'm all for an inclusion.

(as a side node, patch for the 0.6 version are (now) attached to the
jira ticket.
Should make it much more easier for those who want to test than checking the
old svn version and merge back to 0.6)

>
> On Wed, Mar 31, 2010 at 3:43 PM, Daniel Kluesing <dk...@bluekai.com> wrote:
>>
>> We also applied this patch to the 0.6 branch and have been running it for
>> a bit over a week. Works well, would love to see it get into trunk/0.7
>> proper.
>>
>>
>>
>> From: Ryan Daum [mailto:ryan@thimbleware.com]
>> Sent: Wednesday, March 31, 2010 11:49 AM
>> To: user@cassandra.apache.org
>> Subject: Re: expiring data out of Cassandra/time to live
>>
>>
>>
>> I was able to successfully merge this patch into the 0.6 branch a few
>> weeks ago by doing the following:
>>
>>
>>
>> Downloading the patch
>> Checking out the trunk of Cassandra from github
>> Rolling back (checking out) the git repo to the same date that the patch
>> was submitted to Jira
>> Applying the patch
>> Committing to Git
>> Merging forward to the 0.6 branch
>> Resolve one or two minor conflicts.
>>
>>
>>
>> R
>>
>>
>>
>> On Wed, Mar 31, 2010 at 2:46 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> Sounds like you want to follow
>> https://issues.apache.org/jira/browse/CASSANDRA-699.  There is a patch
>> there but I wouldn't recommend merging it if Java scares you. :)
>>
>> On Wed, Mar 31, 2010 at 1:39 PM, Mike Gallamore
>> <mi...@googlemail.com> wrote:
>> > Hello everyone,
>> >
>> > I saw a thread on the incubator user chat that started a few months ago:
>> >
>> > http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg02047.html
>> > . It looks like this is the new official user mailing list so I'll add
>> > my
>> > thoughts/question here.
>> >
>> > Is there any way to set a TTL on data stored in Cassandra? Deleting old
>> > SSTables isn't enough for my needs. I need the data to go away after a
>> > fixed
>> > period of time. Here is what I'm trying to do and my reasoning why I
>> > think
>> > Cassandra and not something like Flare/Memcache mets my need:
>> >
>> > I'm building a reputation system. We get lots of data at my work (in the
>> > 10's of GB of reputation data a day). The trick is that old data is not
>> > useful as a senders ip address might have changed, they might have had a
>> > bot
>> > on their system and no have removed it, etc. So I need to be able to
>> > keep
>> > data for a fixed period of time and then afterwords it isn't
>> > needed/ideally
>> > would be GC'd out.
>> >
>> > We want to do one thing if we either never heard of the individual or at
>> > least not since the expiry time, and another thing based on the
>> > reputation
>> > data that is stored in Cassandra if it is current. So ideally a
>> > Cassandra
>> > call for a key for someone who's reputation is expired would return
>> > nothing
>> > and we'd reply with our default reputation for that individual. There
>> > really
>> > is no point using network bandwidth to return all the fields associated
>> > with
>> > that key only to look at a timestamp and end up ignoring it anyways.
>> > Similarly the latency of requesting first the timestamp and then the
>> > data in
>> > two separate requests is prohibitive.
>> >
>> > Why Cassandra:
>> >
>> > Our data is complex and is hard to handle completely in a key/value
>> > sense.
>> > In the past we were doing this and just encoding the complex structure
>> > inside of JSON but this isn't ideal. It is very nice algorithmically to
>> > be
>> > able to say: give me this column, or update this element of this hash
>> > etc,
>> > rather than having to pull the old version, decode, modify, re-encode
>> > and
>> > push back to a cache based system.
>> > Our data is large (in the low TB's at the moment, but expected to grow
>> > to
>> > 50-100TB of live data)
>> > Need quick response for both searches and writes: typically for each
>> > thing
>> > we track we get a request for the reputation, the message gets processed
>> > and
>> > then we get feedback back from the recipient. So reads and writes are
>> > symmetric.
>> > High request rate: millions per hour
>> > hundreds of millions of unique reputations (this is way crawling though
>> > the
>> > data with a script purging old data doesn't make sense)
>> > Availablity/load balancing a must. Data needs to be replicated a disk
>> > copy
>> > is useful so if we have a power outage we don't lose the system.
>> > It would be interesting to keep a local subset of our data at customers
>> > sites and have them "replicate up" there data rather than send there
>> > feedback in a different manner that then has to be processed and pumped
>> > into
>> > our datastore (hopefully this is possible with Cassandra with some
>> > creative
>> > choices of how the data is hashed between nodes)
>> >
>> > Does the capability to set an expiry time exist? If not is there any
>> > plans
>> > to add it? My java experience is very limited (I'm accessing Cassandra
>> > via
>> > thrift/Perl) so it isn't something I'd be able to jump in and run with
>> > myself.
>> >
>>
>>
>