You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Bill Au <bi...@gmail.com> on 2010/03/12 04:34:11 UTC

question about deleting from cassandra

Let take Twitter as an example.  All the tweets are timestamped.  I want to
keep only a month's worth of tweets for each user.  The number of tweets
that fit within this one month window varies from user to user.  What is the
best way to accomplish this?  There are millions of users.  Do I need to
loop through all of them and handle the delete one user at a time?  Or is
there a better way to do this?  If a user has not post a new tweet in more
than a month, I also want to remove the user itself.  Do I also need to do
looking through all the users one at a time?

Bill

Re: question about deleting from cassandra

Posted by Peter Chang <pe...@gmail.com>.

I've been thinking more about a similar sort of problem.

The major difference between normal relational databases and big hashtables
is that in the former you can sort and retrieve on any column. In big
hashtables (or at least from Cassandra), you only have 1 field to sort on
and the sort type is predetermined.

>From a theoretical perspective, your traditional DBMS typically allows you
to create arbitrary indexes in order to speed up access. I'm thinking the
same can be through of for something like this.

Ergo, I imagine that for different kinds of entities, you can have a
separate supercolumn family that basically serves as an index table. From
what I've heard, this is somewhat indicated.

In a broader perspective, you can also use tables that serve as metadata.
Ergo, you could store keys of all posts bucketed by some time period (eg.
month).

Peter

On Thu, Mar 11, 2010 at 7:34 PM, Bill Au <bi...@gmail.com> wrote:

> Let take Twitter as an example.  All the tweets are timestamped.  I want to
> keep only a month's worth of tweets for each user.  The number of tweets
> that fit within this one month window varies from user to user.  What is the
> best way to accomplish this?  There are millions of users.  Do I need to
> loop through all of them and handle the delete one user at a time?  Or is
> there a better way to do this?  If a user has not post a new tweet in more
> than a month, I also want to remove the user itself.  Do I also need to do
> looking through all the users one at a time?
>
> Bill
>

Re: question about deleting from cassandra

Posted by Weijun Li <we...@gmail.com>.

The changes to FBUtilities.java is quite simple (just add one method). You
can search the ExpiringColumn in our mailing list and found that one to
which Sylvain attached 3 patches for branch 0.5.0. That's where I started
and the patch worked successfully.

-Weijun

On Sun, Mar 14, 2010 at 6:29 AM, Ryan Daum <ry...@thimbleware.com> wrote:

> +1, I'd like to try this patch but am running into error: patch failed:
> src/java/org/apache/cassandra/utils/FBUtilities.java:342
>
> Alternatively, someone could create a github fork which incorporates this
> patch?
>
> Ryan
>
> On Sat, Mar 13, 2010 at 3:36 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>
>> since they are separate changes, it's much easier to review if they
>> are submitted separately.
>>
>> On 3/13/10, Weijun Li <we...@gmail.com> wrote:
>> > Sure. I'm making another change for cross multiple DC replication, once
>> this
>> > one is done (probably in next week) I'll submit them together to Jira.
>> All
>> > based on 0.6 beta2.
>> >
>> > -Weijun
>> >
>> > -----Original Message-----
>> > From: Jonathan Ellis [mailto:jbellis@gmail.com]
>> > Sent: Saturday, March 13, 2010 5:36 AM
>> > To: cassandra-user@incubator.apache.org
>> > Subject: Re: question about deleting from cassandra
>> >
>> > You should submit your minor change to jira for others who might want to
>> try
>> > it.
>> >
>> > On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <we...@gmail.com> wrote:
>> >> Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
>> >> perfectly. Without this feature, as far as you have high volume new and
>> >> expired columns your life will be miserable :-)
>> >>
>> >> Thanks for great job Sylvain!!
>> >>
>> >> -Weijun
>> >>
>> >> On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sy...@yakaz.com>
>> >> wrote:
>> >>>
>> >>> I guess you can also vote for this ticket :
>> >>> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
>> >>>
>> >>> </advertising>
>> >>>
>> >>> --
>> >>> Sylvain
>> >>>
>> >>>
>> >>> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <ma...@gmail.com>
>> wrote:
>> >>> > On 12 March 2010 03:34, Bill Au <bi...@gmail.com> wrote:
>> >>> >>
>> >>> >> Let take Twitter as an example.  All the tweets are timestamped.  I
>> >>> >> want
>> >>> >> to keep only a month's worth of tweets for each user.  The number
>> of
>> >>> >> tweets
>> >>> >> that fit within this one month window varies from user to user.
>>  What
>> >>> >> is the
>> >>> >> best way to accomplish this?
>> >>> >
>> >>> > This is the "expiry" problem that has been discussed on this list
>> >>> > before. As
>> >>> > far as I can see there are no easy ways to do it with 0.5
>> >>> >
>> >>> > If you use the ordered partitioner and make the first part of the
>> keys
>> > a
>> >>> > timestamp (or part of it) then you can get the keys and delete them.
>> >>> >
>> >>> > However, these deletes will be quite inefficient, currently each row
>> >>> > must be
>> >>> > deleted individually (there was a patch to range delete kicking
>> around,
>> >>> > I
>> >>> > don't know if it's accepted yet)
>> >>> >
>> >>> > But even if range delete is implemented, it's still quite
>> inefficient
>> >>> > and
>> >>> > not really what you want, and doesn't work with the
>> RandomPartitioner
>> >>> >
>> >>> > If you have some metadata to say who tweeted within a given period
>> (say
>> >>> > 10
>> >>> > days or 30 days) and you store the tweets all in the same key per
>> user
>> >>> > per
>> >>> > period (say with one column per tweet, or use supercolumns), then
>> you
>> >>> > can
>> >>> > just delete one key per user per period.
>> >>> >
>> >>> > One of the problems with using a time-based key with ordered
>> > partitioner
>> >>> > is
>> >>> > that you're always going to have a data imbalance, so you may want
>> to
>> >>> > try
>> >>> > hashing *part* of the key (The first part) so you can still range
>> scan
>> >>> > the
>> >>> > next part. This may fix load balancing while still enabling you to
>> use
>> >>> > range
>> >>> > scans to do data expiry.
>> >>> >
>> >>> > e.g. your key is
>> >>> >
>> >>> > Hash of day number + user id + timestamp
>> >>> >
>> >>> > Then you can range scan the entire day's tweets to expire them, and
>> >>> > range
>> >>> > scan a given user's tweets for a given day efficiently (and doing
>> this
>> >>> > for
>> >>> > 30 days is just 30 range scans)
>> >>> >
>> >>> > Putting a hash in there fixes load balancing with OPP.
>> >>> >
>> >>> > Mark
>> >>> >
>> >>
>> >>
>> >
>> >
>>
>
>

Re: question about deleting from cassandra

Posted by Ryan Daum <ry...@thimbleware.com>.

+1, I'd like to try this patch but am running into error: patch failed:
src/java/org/apache/cassandra/utils/FBUtilities.java:342

Alternatively, someone could create a github fork which incorporates this
patch?

Ryan

On Sat, Mar 13, 2010 at 3:36 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> since they are separate changes, it's much easier to review if they
> are submitted separately.
>
> On 3/13/10, Weijun Li <we...@gmail.com> wrote:
> > Sure. I'm making another change for cross multiple DC replication, once
> this
> > one is done (probably in next week) I'll submit them together to Jira.
> All
> > based on 0.6 beta2.
> >
> > -Weijun
> >
> > -----Original Message-----
> > From: Jonathan Ellis [mailto:jbellis@gmail.com]
> > Sent: Saturday, March 13, 2010 5:36 AM
> > To: cassandra-user@incubator.apache.org
> > Subject: Re: question about deleting from cassandra
> >
> > You should submit your minor change to jira for others who might want to
> try
> > it.
> >
> > On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <we...@gmail.com> wrote:
> >> Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
> >> perfectly. Without this feature, as far as you have high volume new and
> >> expired columns your life will be miserable :-)
> >>
> >> Thanks for great job Sylvain!!
> >>
> >> -Weijun
> >>
> >> On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sy...@yakaz.com>
> >> wrote:
> >>>
> >>> I guess you can also vote for this ticket :
> >>> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
> >>>
> >>> </advertising>
> >>>
> >>> --
> >>> Sylvain
> >>>
> >>>
> >>> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <ma...@gmail.com> wrote:
> >>> > On 12 March 2010 03:34, Bill Au <bi...@gmail.com> wrote:
> >>> >>
> >>> >> Let take Twitter as an example.  All the tweets are timestamped.  I
> >>> >> want
> >>> >> to keep only a month's worth of tweets for each user.  The number of
> >>> >> tweets
> >>> >> that fit within this one month window varies from user to user.
>  What
> >>> >> is the
> >>> >> best way to accomplish this?
> >>> >
> >>> > This is the "expiry" problem that has been discussed on this list
> >>> > before. As
> >>> > far as I can see there are no easy ways to do it with 0.5
> >>> >
> >>> > If you use the ordered partitioner and make the first part of the
> keys
> > a
> >>> > timestamp (or part of it) then you can get the keys and delete them.
> >>> >
> >>> > However, these deletes will be quite inefficient, currently each row
> >>> > must be
> >>> > deleted individually (there was a patch to range delete kicking
> around,
> >>> > I
> >>> > don't know if it's accepted yet)
> >>> >
> >>> > But even if range delete is implemented, it's still quite inefficient
> >>> > and
> >>> > not really what you want, and doesn't work with the RandomPartitioner
> >>> >
> >>> > If you have some metadata to say who tweeted within a given period
> (say
> >>> > 10
> >>> > days or 30 days) and you store the tweets all in the same key per
> user
> >>> > per
> >>> > period (say with one column per tweet, or use supercolumns), then you
> >>> > can
> >>> > just delete one key per user per period.
> >>> >
> >>> > One of the problems with using a time-based key with ordered
> > partitioner
> >>> > is
> >>> > that you're always going to have a data imbalance, so you may want to
> >>> > try
> >>> > hashing *part* of the key (The first part) so you can still range
> scan
> >>> > the
> >>> > next part. This may fix load balancing while still enabling you to
> use
> >>> > range
> >>> > scans to do data expiry.
> >>> >
> >>> > e.g. your key is
> >>> >
> >>> > Hash of day number + user id + timestamp
> >>> >
> >>> > Then you can range scan the entire day's tweets to expire them, and
> >>> > range
> >>> > scan a given user's tweets for a given day efficiently (and doing
> this
> >>> > for
> >>> > 30 days is just 30 range scans)
> >>> >
> >>> > Putting a hash in there fixes load balancing with OPP.
> >>> >
> >>> > Mark
> >>> >
> >>
> >>
> >
> >
>

Re: question about deleting from cassandra

Posted by Jonathan Ellis <jb...@gmail.com>.

since they are separate changes, it's much easier to review if they
are submitted separately.

On 3/13/10, Weijun Li <we...@gmail.com> wrote:
> Sure. I'm making another change for cross multiple DC replication, once this
> one is done (probably in next week) I'll submit them together to Jira. All
> based on 0.6 beta2.
>
> -Weijun
>
> -----Original Message-----
> From: Jonathan Ellis [mailto:jbellis@gmail.com]
> Sent: Saturday, March 13, 2010 5:36 AM
> To: cassandra-user@incubator.apache.org
> Subject: Re: question about deleting from cassandra
>
> You should submit your minor change to jira for others who might want to try
> it.
>
> On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <we...@gmail.com> wrote:
>> Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
>> perfectly. Without this feature, as far as you have high volume new and
>> expired columns your life will be miserable :-)
>>
>> Thanks for great job Sylvain!!
>>
>> -Weijun
>>
>> On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sy...@yakaz.com>
>> wrote:
>>>
>>> I guess you can also vote for this ticket :
>>> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
>>>
>>> </advertising>
>>>
>>> --
>>> Sylvain
>>>
>>>
>>> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <ma...@gmail.com> wrote:
>>> > On 12 March 2010 03:34, Bill Au <bi...@gmail.com> wrote:
>>> >>
>>> >> Let take Twitter as an example.  All the tweets are timestamped.  I
>>> >> want
>>> >> to keep only a month's worth of tweets for each user.  The number of
>>> >> tweets
>>> >> that fit within this one month window varies from user to user.  What
>>> >> is the
>>> >> best way to accomplish this?
>>> >
>>> > This is the "expiry" problem that has been discussed on this list
>>> > before. As
>>> > far as I can see there are no easy ways to do it with 0.5
>>> >
>>> > If you use the ordered partitioner and make the first part of the keys
> a
>>> > timestamp (or part of it) then you can get the keys and delete them.
>>> >
>>> > However, these deletes will be quite inefficient, currently each row
>>> > must be
>>> > deleted individually (there was a patch to range delete kicking around,
>>> > I
>>> > don't know if it's accepted yet)
>>> >
>>> > But even if range delete is implemented, it's still quite inefficient
>>> > and
>>> > not really what you want, and doesn't work with the RandomPartitioner
>>> >
>>> > If you have some metadata to say who tweeted within a given period (say
>>> > 10
>>> > days or 30 days) and you store the tweets all in the same key per user
>>> > per
>>> > period (say with one column per tweet, or use supercolumns), then you
>>> > can
>>> > just delete one key per user per period.
>>> >
>>> > One of the problems with using a time-based key with ordered
> partitioner
>>> > is
>>> > that you're always going to have a data imbalance, so you may want to
>>> > try
>>> > hashing *part* of the key (The first part) so you can still range scan
>>> > the
>>> > next part. This may fix load balancing while still enabling you to use
>>> > range
>>> > scans to do data expiry.
>>> >
>>> > e.g. your key is
>>> >
>>> > Hash of day number + user id + timestamp
>>> >
>>> > Then you can range scan the entire day's tweets to expire them, and
>>> > range
>>> > scan a given user's tweets for a given day efficiently (and doing this
>>> > for
>>> > 30 days is just 30 range scans)
>>> >
>>> > Putting a hash in there fixes load balancing with OPP.
>>> >
>>> > Mark
>>> >
>>
>>
>
>

RE: question about deleting from cassandra

Posted by Weijun Li <we...@gmail.com>.

Sure. I'm making another change for cross multiple DC replication, once this
one is done (probably in next week) I'll submit them together to Jira. All
based on 0.6 beta2.

-Weijun

-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com] 
Sent: Saturday, March 13, 2010 5:36 AM
To: cassandra-user@incubator.apache.org
Subject: Re: question about deleting from cassandra

You should submit your minor change to jira for others who might want to try
it.

On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <we...@gmail.com> wrote:
> Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
> perfectly. Without this feature, as far as you have high volume new and
> expired columns your life will be miserable :-)
>
> Thanks for great job Sylvain!!
>
> -Weijun
>
> On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sy...@yakaz.com>
> wrote:
>>
>> I guess you can also vote for this ticket :
>> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
>>
>> </advertising>
>>
>> --
>> Sylvain
>>
>>
>> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <ma...@gmail.com> wrote:
>> > On 12 March 2010 03:34, Bill Au <bi...@gmail.com> wrote:
>> >>
>> >> Let take Twitter as an example.  All the tweets are timestamped.  I
>> >> want
>> >> to keep only a month's worth of tweets for each user.  The number of
>> >> tweets
>> >> that fit within this one month window varies from user to user.  What
>> >> is the
>> >> best way to accomplish this?
>> >
>> > This is the "expiry" problem that has been discussed on this list
>> > before. As
>> > far as I can see there are no easy ways to do it with 0.5
>> >
>> > If you use the ordered partitioner and make the first part of the keys
a
>> > timestamp (or part of it) then you can get the keys and delete them.
>> >
>> > However, these deletes will be quite inefficient, currently each row
>> > must be
>> > deleted individually (there was a patch to range delete kicking around,
>> > I
>> > don't know if it's accepted yet)
>> >
>> > But even if range delete is implemented, it's still quite inefficient
>> > and
>> > not really what you want, and doesn't work with the RandomPartitioner
>> >
>> > If you have some metadata to say who tweeted within a given period (say
>> > 10
>> > days or 30 days) and you store the tweets all in the same key per user
>> > per
>> > period (say with one column per tweet, or use supercolumns), then you
>> > can
>> > just delete one key per user per period.
>> >
>> > One of the problems with using a time-based key with ordered
partitioner
>> > is
>> > that you're always going to have a data imbalance, so you may want to
>> > try
>> > hashing *part* of the key (The first part) so you can still range scan
>> > the
>> > next part. This may fix load balancing while still enabling you to use
>> > range
>> > scans to do data expiry.
>> >
>> > e.g. your key is
>> >
>> > Hash of day number + user id + timestamp
>> >
>> > Then you can range scan the entire day's tweets to expire them, and
>> > range
>> > scan a given user's tweets for a given day efficiently (and doing this
>> > for
>> > 30 days is just 30 range scans)
>> >
>> > Putting a hash in there fixes load balancing with OPP.
>> >
>> > Mark
>> >
>
>

Re: question about deleting from cassandra

Posted by Tatu Saloranta <ts...@gmail.com>.

On Thu, Mar 18, 2010 at 7:31 AM, Vick Khera <vi...@khera.org> wrote:
> On Thu, Mar 18, 2010 at 9:15 AM, Bill Au <bi...@gmail.com> wrote:
>> In theory there is a breaking point somewhere, right?
>
> I don't think google has hit it yet, so I'd have to say nobody has
> reached "the breaking point" yet....
>
> What do the big places do when people quit the service?  Ie, if I
> close my facebook or twitter, does all that data just continue to sit
> there?

One typical solution would be to move data to offline (backup to
tapes, back in the day).
Take a snapshot of relevant data, add to cheaper storage, remove from
primary storage.
This is used for all kinds of things, cleaning up inactive accounts,
rolling old logs, backups.
So it does not always need to mean complete hard deletes, but maybe
just moving to secondary storage, from which it has to be explicitly
reinstated if need be.

Most common I assume is just soft-deleting things tho; adding a flag
indicating that the thing is not to be surfaced, even though it is
still stored. This can of course be combined with vacuuming stale
stuff to secondary storage. Would be nice to have system that does
this.

-+ Tatu +-

Re: question about deleting from cassandra

Posted by Vick Khera <vi...@khera.org>.

On Thu, Mar 18, 2010 at 9:15 AM, Bill Au <bi...@gmail.com> wrote:
> In theory there is a breaking point somewhere, right?

I don't think google has hit it yet, so I'd have to say nobody has
reached "the breaking point" yet....

What do the big places do when people quit the service?  Ie, if I
close my facebook or twitter, does all that data just continue to sit
there?

How does that translate to services people actually pay for, too?  Ie,
if say smugmug were using cassandra to store images and someone quit,
they'd probably be expected to remove that content since they are no
longer paying.

Re: question about deleting from cassandra

Posted by Bill Au <bi...@gmail.com>.

That is very true from the users' point of view, especially since their data
is being stored for free.  But I am looking at it from the service
providers' point of view.  Maybe that's why NoSQL solutions are so popular
right now since they scale much better than RDBMS.  I wonder if service
providers just keep adding more and more machines as the number of users and
amount of data grows.  In theory there is a breaking point somewhere, right?

Bill

On Wed, Mar 17, 2010 at 10:28 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> That's a strange assumption.  Users typically don't like their data
> being deleted without a very good reason.  "We didn't have enough
> room" is not a very good reason. :)
>
> On Wed, Mar 17, 2010 at 9:03 PM, Bill Au <bi...@gmail.com> wrote:
> > I would assume that Facebook and Twitter are not keep all the data that
> they
> > store in Cassandra forever.  I wonder how are they deleting old data from
> > Cassandra...
> > Bill
> >
> > On Mon, Mar 15, 2010 at 1:01 PM, Weijun Li <we...@gmail.com> wrote:
> >>
> >> OK I will try to separate them out.
> >>
> >> On Sat, Mar 13, 2010 at 5:35 AM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> >>>
> >>> You should submit your minor change to jira for others who might want
> to
> >>> try it.
> >>>
> >>> On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <we...@gmail.com> wrote:
> >>> > Tried Sylvain's feature in 0.6 beta2 (need minor change) and it
> worked
> >>> > perfectly. Without this feature, as far as you have high volume new
> and
> >>> > expired columns your life will be miserable :-)
> >>> >
> >>> > Thanks for great job Sylvain!!
> >>> >
> >>> > -Weijun
> >>> >
> >>> > On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <
> sylvain@yakaz.com>
> >>> > wrote:
> >>> >>
> >>> >> I guess you can also vote for this ticket :
> >>> >> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
> >>> >>
> >>> >> </advertising>
> >>> >>
> >>> >> --
> >>> >> Sylvain
> >>> >>
> >>> >>
> >>> >> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <ma...@gmail.com>
> wrote:
> >>> >> > On 12 March 2010 03:34, Bill Au <bi...@gmail.com> wrote:
> >>> >> >>
> >>> >> >> Let take Twitter as an example.  All the tweets are timestamped.
>  I
> >>> >> >> want
> >>> >> >> to keep only a month's worth of tweets for each user.  The number
> >>> >> >> of
> >>> >> >> tweets
> >>> >> >> that fit within this one month window varies from user to user.
> >>> >> >>  What
> >>> >> >> is the
> >>> >> >> best way to accomplish this?
> >>> >> >
> >>> >> > This is the "expiry" problem that has been discussed on this list
> >>> >> > before. As
> >>> >> > far as I can see there are no easy ways to do it with 0.5
> >>> >> >
> >>> >> > If you use the ordered partitioner and make the first part of the
> >>> >> > keys a
> >>> >> > timestamp (or part of it) then you can get the keys and delete
> them.
> >>> >> >
> >>> >> > However, these deletes will be quite inefficient, currently each
> row
> >>> >> > must be
> >>> >> > deleted individually (there was a patch to range delete kicking
> >>> >> > around,
> >>> >> > I
> >>> >> > don't know if it's accepted yet)
> >>> >> >
> >>> >> > But even if range delete is implemented, it's still quite
> >>> >> > inefficient
> >>> >> > and
> >>> >> > not really what you want, and doesn't work with the
> >>> >> > RandomPartitioner
> >>> >> >
> >>> >> > If you have some metadata to say who tweeted within a given period
> >>> >> > (say
> >>> >> > 10
> >>> >> > days or 30 days) and you store the tweets all in the same key per
> >>> >> > user
> >>> >> > per
> >>> >> > period (say with one column per tweet, or use supercolumns), then
> >>> >> > you
> >>> >> > can
> >>> >> > just delete one key per user per period.
> >>> >> >
> >>> >> > One of the problems with using a time-based key with ordered
> >>> >> > partitioner
> >>> >> > is
> >>> >> > that you're always going to have a data imbalance, so you may want
> >>> >> > to
> >>> >> > try
> >>> >> > hashing *part* of the key (The first part) so you can still range
> >>> >> > scan
> >>> >> > the
> >>> >> > next part. This may fix load balancing while still enabling you to
> >>> >> > use
> >>> >> > range
> >>> >> > scans to do data expiry.
> >>> >> >
> >>> >> > e.g. your key is
> >>> >> >
> >>> >> > Hash of day number + user id + timestamp
> >>> >> >
> >>> >> > Then you can range scan the entire day's tweets to expire them,
> and
> >>> >> > range
> >>> >> > scan a given user's tweets for a given day efficiently (and doing
> >>> >> > this
> >>> >> > for
> >>> >> > 30 days is just 30 range scans)
> >>> >> >
> >>> >> > Putting a hash in there fixes load balancing with OPP.
> >>> >> >
> >>> >> > Mark
> >>> >> >
> >>> >
> >>> >
> >>
> >
> >
>

Re: question about deleting from cassandra

Posted by Jonathan Ellis <jb...@gmail.com>.

That's a strange assumption.  Users typically don't like their data
being deleted without a very good reason.  "We didn't have enough
room" is not a very good reason. :)

On Wed, Mar 17, 2010 at 9:03 PM, Bill Au <bi...@gmail.com> wrote:
> I would assume that Facebook and Twitter are not keep all the data that they
> store in Cassandra forever.  I wonder how are they deleting old data from
> Cassandra...
> Bill
>
> On Mon, Mar 15, 2010 at 1:01 PM, Weijun Li <we...@gmail.com> wrote:
>>
>> OK I will try to separate them out.
>>
>> On Sat, Mar 13, 2010 at 5:35 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>>>
>>> You should submit your minor change to jira for others who might want to
>>> try it.
>>>
>>> On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <we...@gmail.com> wrote:
>>> > Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
>>> > perfectly. Without this feature, as far as you have high volume new and
>>> > expired columns your life will be miserable :-)
>>> >
>>> > Thanks for great job Sylvain!!
>>> >
>>> > -Weijun
>>> >
>>> > On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sy...@yakaz.com>
>>> > wrote:
>>> >>
>>> >> I guess you can also vote for this ticket :
>>> >> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
>>> >>
>>> >> </advertising>
>>> >>
>>> >> --
>>> >> Sylvain
>>> >>
>>> >>
>>> >> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <ma...@gmail.com> wrote:
>>> >> > On 12 March 2010 03:34, Bill Au <bi...@gmail.com> wrote:
>>> >> >>
>>> >> >> Let take Twitter as an example.  All the tweets are timestamped.  I
>>> >> >> want
>>> >> >> to keep only a month's worth of tweets for each user.  The number
>>> >> >> of
>>> >> >> tweets
>>> >> >> that fit within this one month window varies from user to user.
>>> >> >>  What
>>> >> >> is the
>>> >> >> best way to accomplish this?
>>> >> >
>>> >> > This is the "expiry" problem that has been discussed on this list
>>> >> > before. As
>>> >> > far as I can see there are no easy ways to do it with 0.5
>>> >> >
>>> >> > If you use the ordered partitioner and make the first part of the
>>> >> > keys a
>>> >> > timestamp (or part of it) then you can get the keys and delete them.
>>> >> >
>>> >> > However, these deletes will be quite inefficient, currently each row
>>> >> > must be
>>> >> > deleted individually (there was a patch to range delete kicking
>>> >> > around,
>>> >> > I
>>> >> > don't know if it's accepted yet)
>>> >> >
>>> >> > But even if range delete is implemented, it's still quite
>>> >> > inefficient
>>> >> > and
>>> >> > not really what you want, and doesn't work with the
>>> >> > RandomPartitioner
>>> >> >
>>> >> > If you have some metadata to say who tweeted within a given period
>>> >> > (say
>>> >> > 10
>>> >> > days or 30 days) and you store the tweets all in the same key per
>>> >> > user
>>> >> > per
>>> >> > period (say with one column per tweet, or use supercolumns), then
>>> >> > you
>>> >> > can
>>> >> > just delete one key per user per period.
>>> >> >
>>> >> > One of the problems with using a time-based key with ordered
>>> >> > partitioner
>>> >> > is
>>> >> > that you're always going to have a data imbalance, so you may want
>>> >> > to
>>> >> > try
>>> >> > hashing *part* of the key (The first part) so you can still range
>>> >> > scan
>>> >> > the
>>> >> > next part. This may fix load balancing while still enabling you to
>>> >> > use
>>> >> > range
>>> >> > scans to do data expiry.
>>> >> >
>>> >> > e.g. your key is
>>> >> >
>>> >> > Hash of day number + user id + timestamp
>>> >> >
>>> >> > Then you can range scan the entire day's tweets to expire them, and
>>> >> > range
>>> >> > scan a given user's tweets for a given day efficiently (and doing
>>> >> > this
>>> >> > for
>>> >> > 30 days is just 30 range scans)
>>> >> >
>>> >> > Putting a hash in there fixes load balancing with OPP.
>>> >> >
>>> >> > Mark
>>> >> >
>>> >
>>> >
>>
>
>

Re: question about deleting from cassandra

Posted by Bill Au <bi...@gmail.com>.

I would assume that Facebook and Twitter are not keep all the data that they
store in Cassandra forever.  I wonder how are they deleting old data from
Cassandra...

Bill

On Mon, Mar 15, 2010 at 1:01 PM, Weijun Li <we...@gmail.com> wrote:

> OK I will try to separate them out.
>
>
> On Sat, Mar 13, 2010 at 5:35 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>
>> You should submit your minor change to jira for others who might want to
>> try it.
>>
>> On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <we...@gmail.com> wrote:
>> > Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
>> > perfectly. Without this feature, as far as you have high volume new and
>> > expired columns your life will be miserable :-)
>> >
>> > Thanks for great job Sylvain!!
>> >
>> > -Weijun
>> >
>> > On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sy...@yakaz.com>
>> > wrote:
>> >>
>> >> I guess you can also vote for this ticket :
>> >> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
>> >>
>> >> </advertising>
>> >>
>> >> --
>> >> Sylvain
>> >>
>> >>
>> >> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <ma...@gmail.com> wrote:
>> >> > On 12 March 2010 03:34, Bill Au <bi...@gmail.com> wrote:
>> >> >>
>> >> >> Let take Twitter as an example.  All the tweets are timestamped.  I
>> >> >> want
>> >> >> to keep only a month's worth of tweets for each user.  The number of
>> >> >> tweets
>> >> >> that fit within this one month window varies from user to user.
>>  What
>> >> >> is the
>> >> >> best way to accomplish this?
>> >> >
>> >> > This is the "expiry" problem that has been discussed on this list
>> >> > before. As
>> >> > far as I can see there are no easy ways to do it with 0.5
>> >> >
>> >> > If you use the ordered partitioner and make the first part of the
>> keys a
>> >> > timestamp (or part of it) then you can get the keys and delete them.
>> >> >
>> >> > However, these deletes will be quite inefficient, currently each row
>> >> > must be
>> >> > deleted individually (there was a patch to range delete kicking
>> around,
>> >> > I
>> >> > don't know if it's accepted yet)
>> >> >
>> >> > But even if range delete is implemented, it's still quite inefficient
>> >> > and
>> >> > not really what you want, and doesn't work with the RandomPartitioner
>> >> >
>> >> > If you have some metadata to say who tweeted within a given period
>> (say
>> >> > 10
>> >> > days or 30 days) and you store the tweets all in the same key per
>> user
>> >> > per
>> >> > period (say with one column per tweet, or use supercolumns), then you
>> >> > can
>> >> > just delete one key per user per period.
>> >> >
>> >> > One of the problems with using a time-based key with ordered
>> partitioner
>> >> > is
>> >> > that you're always going to have a data imbalance, so you may want to
>> >> > try
>> >> > hashing *part* of the key (The first part) so you can still range
>> scan
>> >> > the
>> >> > next part. This may fix load balancing while still enabling you to
>> use
>> >> > range
>> >> > scans to do data expiry.
>> >> >
>> >> > e.g. your key is
>> >> >
>> >> > Hash of day number + user id + timestamp
>> >> >
>> >> > Then you can range scan the entire day's tweets to expire them, and
>> >> > range
>> >> > scan a given user's tweets for a given day efficiently (and doing
>> this
>> >> > for
>> >> > 30 days is just 30 range scans)
>> >> >
>> >> > Putting a hash in there fixes load balancing with OPP.
>> >> >
>> >> > Mark
>> >> >
>> >
>> >
>>
>
>

Re: question about deleting from cassandra

Posted by Sylvain Lebresne <sy...@yakaz.com>.

Hi,

I modified the patch to work against the current 0.6 svn branch (as I
needed it myself). I attached the files to jira if someone want to play
with it. Maybe should I remove the old files, as they were only working
against an old random svn trunk ?

--
Sylvain

On Mon, Mar 15, 2010 at 6:01 PM, Weijun Li <we...@gmail.com> wrote:
> OK I will try to separate them out.
>
> On Sat, Mar 13, 2010 at 5:35 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> You should submit your minor change to jira for others who might want to
>> try it.
>>
>> On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <we...@gmail.com> wrote:
>> > Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
>> > perfectly. Without this feature, as far as you have high volume new and
>> > expired columns your life will be miserable :-)
>> >
>> > Thanks for great job Sylvain!!
>> >
>> > -Weijun
>> >
>> > On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sy...@yakaz.com>
>> > wrote:
>> >>
>> >> I guess you can also vote for this ticket :
>> >> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
>> >>
>> >> </advertising>
>> >>
>> >> --
>> >> Sylvain
>> >>
>> >>
>> >> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <ma...@gmail.com> wrote:
>> >> > On 12 March 2010 03:34, Bill Au <bi...@gmail.com> wrote:
>> >> >>
>> >> >> Let take Twitter as an example.  All the tweets are timestamped.  I
>> >> >> want
>> >> >> to keep only a month's worth of tweets for each user.  The number of
>> >> >> tweets
>> >> >> that fit within this one month window varies from user to user.
>> >> >>  What
>> >> >> is the
>> >> >> best way to accomplish this?
>> >> >
>> >> > This is the "expiry" problem that has been discussed on this list
>> >> > before. As
>> >> > far as I can see there are no easy ways to do it with 0.5
>> >> >
>> >> > If you use the ordered partitioner and make the first part of the
>> >> > keys a
>> >> > timestamp (or part of it) then you can get the keys and delete them.
>> >> >
>> >> > However, these deletes will be quite inefficient, currently each row
>> >> > must be
>> >> > deleted individually (there was a patch to range delete kicking
>> >> > around,
>> >> > I
>> >> > don't know if it's accepted yet)
>> >> >
>> >> > But even if range delete is implemented, it's still quite inefficient
>> >> > and
>> >> > not really what you want, and doesn't work with the RandomPartitioner
>> >> >
>> >> > If you have some metadata to say who tweeted within a given period
>> >> > (say
>> >> > 10
>> >> > days or 30 days) and you store the tweets all in the same key per
>> >> > user
>> >> > per
>> >> > period (say with one column per tweet, or use supercolumns), then you
>> >> > can
>> >> > just delete one key per user per period.
>> >> >
>> >> > One of the problems with using a time-based key with ordered
>> >> > partitioner
>> >> > is
>> >> > that you're always going to have a data imbalance, so you may want to
>> >> > try
>> >> > hashing *part* of the key (The first part) so you can still range
>> >> > scan
>> >> > the
>> >> > next part. This may fix load balancing while still enabling you to
>> >> > use
>> >> > range
>> >> > scans to do data expiry.
>> >> >
>> >> > e.g. your key is
>> >> >
>> >> > Hash of day number + user id + timestamp
>> >> >
>> >> > Then you can range scan the entire day's tweets to expire them, and
>> >> > range
>> >> > scan a given user's tweets for a given day efficiently (and doing
>> >> > this
>> >> > for
>> >> > 30 days is just 30 range scans)
>> >> >
>> >> > Putting a hash in there fixes load balancing with OPP.
>> >> >
>> >> > Mark
>> >> >
>> >
>> >
>
>

Re: question about deleting from cassandra

Posted by Weijun Li <we...@gmail.com>.

OK I will try to separate them out.

On Sat, Mar 13, 2010 at 5:35 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> You should submit your minor change to jira for others who might want to
> try it.
>
> On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <we...@gmail.com> wrote:
> > Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
> > perfectly. Without this feature, as far as you have high volume new and
> > expired columns your life will be miserable :-)
> >
> > Thanks for great job Sylvain!!
> >
> > -Weijun
> >
> > On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sy...@yakaz.com>
> > wrote:
> >>
> >> I guess you can also vote for this ticket :
> >> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
> >>
> >> </advertising>
> >>
> >> --
> >> Sylvain
> >>
> >>
> >> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <ma...@gmail.com> wrote:
> >> > On 12 March 2010 03:34, Bill Au <bi...@gmail.com> wrote:
> >> >>
> >> >> Let take Twitter as an example.  All the tweets are timestamped.  I
> >> >> want
> >> >> to keep only a month's worth of tweets for each user.  The number of
> >> >> tweets
> >> >> that fit within this one month window varies from user to user.  What
> >> >> is the
> >> >> best way to accomplish this?
> >> >
> >> > This is the "expiry" problem that has been discussed on this list
> >> > before. As
> >> > far as I can see there are no easy ways to do it with 0.5
> >> >
> >> > If you use the ordered partitioner and make the first part of the keys
> a
> >> > timestamp (or part of it) then you can get the keys and delete them.
> >> >
> >> > However, these deletes will be quite inefficient, currently each row
> >> > must be
> >> > deleted individually (there was a patch to range delete kicking
> around,
> >> > I
> >> > don't know if it's accepted yet)
> >> >
> >> > But even if range delete is implemented, it's still quite inefficient
> >> > and
> >> > not really what you want, and doesn't work with the RandomPartitioner
> >> >
> >> > If you have some metadata to say who tweeted within a given period
> (say
> >> > 10
> >> > days or 30 days) and you store the tweets all in the same key per user
> >> > per
> >> > period (say with one column per tweet, or use supercolumns), then you
> >> > can
> >> > just delete one key per user per period.
> >> >
> >> > One of the problems with using a time-based key with ordered
> partitioner
> >> > is
> >> > that you're always going to have a data imbalance, so you may want to
> >> > try
> >> > hashing *part* of the key (The first part) so you can still range scan
> >> > the
> >> > next part. This may fix load balancing while still enabling you to use
> >> > range
> >> > scans to do data expiry.
> >> >
> >> > e.g. your key is
> >> >
> >> > Hash of day number + user id + timestamp
> >> >
> >> > Then you can range scan the entire day's tweets to expire them, and
> >> > range
> >> > scan a given user's tweets for a given day efficiently (and doing this
> >> > for
> >> > 30 days is just 30 range scans)
> >> >
> >> > Putting a hash in there fixes load balancing with OPP.
> >> >
> >> > Mark
> >> >
> >
> >
>

Re: question about deleting from cassandra

Posted by Jonathan Ellis <jb...@gmail.com>.

You should submit your minor change to jira for others who might want to try it.

On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <we...@gmail.com> wrote:
> Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
> perfectly. Without this feature, as far as you have high volume new and
> expired columns your life will be miserable :-)
>
> Thanks for great job Sylvain!!
>
> -Weijun
>
> On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sy...@yakaz.com>
> wrote:
>>
>> I guess you can also vote for this ticket :
>> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
>>
>> </advertising>
>>
>> --
>> Sylvain
>>
>>
>> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <ma...@gmail.com> wrote:
>> > On 12 March 2010 03:34, Bill Au <bi...@gmail.com> wrote:
>> >>
>> >> Let take Twitter as an example.  All the tweets are timestamped.  I
>> >> want
>> >> to keep only a month's worth of tweets for each user.  The number of
>> >> tweets
>> >> that fit within this one month window varies from user to user.  What
>> >> is the
>> >> best way to accomplish this?
>> >
>> > This is the "expiry" problem that has been discussed on this list
>> > before. As
>> > far as I can see there are no easy ways to do it with 0.5
>> >
>> > If you use the ordered partitioner and make the first part of the keys a
>> > timestamp (or part of it) then you can get the keys and delete them.
>> >
>> > However, these deletes will be quite inefficient, currently each row
>> > must be
>> > deleted individually (there was a patch to range delete kicking around,
>> > I
>> > don't know if it's accepted yet)
>> >
>> > But even if range delete is implemented, it's still quite inefficient
>> > and
>> > not really what you want, and doesn't work with the RandomPartitioner
>> >
>> > If you have some metadata to say who tweeted within a given period (say
>> > 10
>> > days or 30 days) and you store the tweets all in the same key per user
>> > per
>> > period (say with one column per tweet, or use supercolumns), then you
>> > can
>> > just delete one key per user per period.
>> >
>> > One of the problems with using a time-based key with ordered partitioner
>> > is
>> > that you're always going to have a data imbalance, so you may want to
>> > try
>> > hashing *part* of the key (The first part) so you can still range scan
>> > the
>> > next part. This may fix load balancing while still enabling you to use
>> > range
>> > scans to do data expiry.
>> >
>> > e.g. your key is
>> >
>> > Hash of day number + user id + timestamp
>> >
>> > Then you can range scan the entire day's tweets to expire them, and
>> > range
>> > scan a given user's tweets for a given day efficiently (and doing this
>> > for
>> > 30 days is just 30 range scans)
>> >
>> > Putting a hash in there fixes load balancing with OPP.
>> >
>> > Mark
>> >
>
>

Re: question about deleting from cassandra

Posted by Weijun Li <we...@gmail.com>.

Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
perfectly. Without this feature, as far as you have high volume new and
expired columns your life will be miserable :-)

Thanks for great job Sylvain!!

-Weijun

On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sy...@yakaz.com>wrote:

> I guess you can also vote for this ticket :
> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
>
> </advertising>
>
> --
> Sylvain
>
>
> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <ma...@gmail.com> wrote:
> > On 12 March 2010 03:34, Bill Au <bi...@gmail.com> wrote:
> >>
> >> Let take Twitter as an example.  All the tweets are timestamped.  I want
> >> to keep only a month's worth of tweets for each user.  The number of
> tweets
> >> that fit within this one month window varies from user to user.  What is
> the
> >> best way to accomplish this?
> >
> > This is the "expiry" problem that has been discussed on this list before.
> As
> > far as I can see there are no easy ways to do it with 0.5
> >
> > If you use the ordered partitioner and make the first part of the keys a
> > timestamp (or part of it) then you can get the keys and delete them.
> >
> > However, these deletes will be quite inefficient, currently each row must
> be
> > deleted individually (there was a patch to range delete kicking around, I
> > don't know if it's accepted yet)
> >
> > But even if range delete is implemented, it's still quite inefficient and
> > not really what you want, and doesn't work with the RandomPartitioner
> >
> > If you have some metadata to say who tweeted within a given period (say
> 10
> > days or 30 days) and you store the tweets all in the same key per user
> per
> > period (say with one column per tweet, or use supercolumns), then you can
> > just delete one key per user per period.
> >
> > One of the problems with using a time-based key with ordered partitioner
> is
> > that you're always going to have a data imbalance, so you may want to try
> > hashing *part* of the key (The first part) so you can still range scan
> the
> > next part. This may fix load balancing while still enabling you to use
> range
> > scans to do data expiry.
> >
> > e.g. your key is
> >
> > Hash of day number + user id + timestamp
> >
> > Then you can range scan the entire day's tweets to expire them, and range
> > scan a given user's tweets for a given day efficiently (and doing this
> for
> > 30 days is just 30 range scans)
> >
> > Putting a hash in there fixes load balancing with OPP.
> >
> > Mark
> >
>

Re: question about deleting from cassandra

Posted by Sylvain Lebresne <sy...@yakaz.com>.

I guess you can also vote for this ticket :
https://issues.apache.org/jira/browse/CASSANDRA-699 :)

</advertising>

--
Sylvain


On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <ma...@gmail.com> wrote:
> On 12 March 2010 03:34, Bill Au <bi...@gmail.com> wrote:
>>
>> Let take Twitter as an example.  All the tweets are timestamped.  I want
>> to keep only a month's worth of tweets for each user.  The number of tweets
>> that fit within this one month window varies from user to user.  What is the
>> best way to accomplish this?
>
> This is the "expiry" problem that has been discussed on this list before. As
> far as I can see there are no easy ways to do it with 0.5
>
> If you use the ordered partitioner and make the first part of the keys a
> timestamp (or part of it) then you can get the keys and delete them.
>
> However, these deletes will be quite inefficient, currently each row must be
> deleted individually (there was a patch to range delete kicking around, I
> don't know if it's accepted yet)
>
> But even if range delete is implemented, it's still quite inefficient and
> not really what you want, and doesn't work with the RandomPartitioner
>
> If you have some metadata to say who tweeted within a given period (say 10
> days or 30 days) and you store the tweets all in the same key per user per
> period (say with one column per tweet, or use supercolumns), then you can
> just delete one key per user per period.
>
> One of the problems with using a time-based key with ordered partitioner is
> that you're always going to have a data imbalance, so you may want to try
> hashing *part* of the key (The first part) so you can still range scan the
> next part. This may fix load balancing while still enabling you to use range
> scans to do data expiry.
>
> e.g. your key is
>
> Hash of day number + user id + timestamp
>
> Then you can range scan the entire day's tweets to expire them, and range
> scan a given user's tweets for a given day efficiently (and doing this for
> 30 days is just 30 range scans)
>
> Putting a hash in there fixes load balancing with OPP.
>
> Mark
>

Re: question about deleting from cassandra

Posted by Mark Robson <ma...@gmail.com>.

On 12 March 2010 03:34, Bill Au <bi...@gmail.com> wrote:

> Let take Twitter as an example.  All the tweets are timestamped.  I want to
> keep only a month's worth of tweets for each user.  The number of tweets
> that fit within this one month window varies from user to user.  What is the
> best way to accomplish this?

This is the "expiry" problem that has been discussed on this list before. As
far as I can see there are no easy ways to do it with 0.5

If you use the ordered partitioner and make the first part of the keys a
timestamp (or part of it) then you can get the keys and delete them.

However, these deletes will be quite inefficient, currently each row must be
deleted individually (there was a patch to range delete kicking around, I
don't know if it's accepted yet)

But even if range delete is implemented, it's still quite inefficient and
not really what you want, and doesn't work with the RandomPartitioner

If you have some metadata to say who tweeted within a given period (say 10
days or 30 days) and you store the tweets all in the same key per user per
period (say with one column per tweet, or use supercolumns), then you can
just delete one key per user per period.

One of the problems with using a time-based key with ordered partitioner is
that you're always going to have a data imbalance, so you may want to try
hashing *part* of the key (The first part) so you can still range scan the
next part. This may fix load balancing while still enabling you to use range
scans to do data expiry.

e.g. your key is

Hash of day number + user id + timestamp

Then you can range scan the entire day's tweets to expire them, and range
scan a given user's tweets for a given day efficiently (and doing this for
30 days is just 30 range scans)

Putting a hash in there fixes load balancing with OPP.

Mark