You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by mike dooley <do...@apple.com> on 2011/01/14 07:38:59 UTC

limiting columns in a row

hi,

the time-to-live feature in 0.7 is very nice and it made me want to ask about
a somewhat similar feature.  

i have a stream of data consisting of entities and associated samples.  so i create 
a row for each entity and the columns in each row contain the samples for that entity.  
when i get around to processing  an entity i only care about the most recent N samples. 
so i read the most recent N columns and delete all the rest.

what i would like would be a column family property that allows me to
specify a maximum number of columns per row.  then i could just keep writing
and not have to do the deletes.

in my case it would be fine if the limit is only 'eventually' applied (so that
sometimes there might be extra columns).

does this seem like a generally useful feature?  if so, would it be hard to
implement (maybe it could be done at compaction time like the TTL feature)?

thanks,
-mike

Re: limiting columns in a row

Posted by Sylvain Lebresne <sy...@riptano.com>.

Hi,

> does this seem like a generally useful feature?

I do think this could be a useful feature. If only because I don't think
there
is any satisfactory/efficient way to do this client side.

> if so, would it be hard to implement (maybe it could be done at compaction
> time like the TTL feature)?

Out of the top of my hat (aka, I haven't really think that through but I'll
still give my opinion), I see the following difficulties:
  1) You can only do this limiting during major compaction or the same cases
     as CASSANDRA-1074 for minor, since you need to make sure the x columns
you
     are keeping are not deleted ones. Or you'll want to disable deletes
     altogether on the cf with this 'limit' option (I feel like this last
     option would really simplify things).
  2) Even if the removal of the column exceeding the limit is eventual (and
it
     will), you'll want query to only ever return column inside the limit
     (otherwise the feature would be too unpredictable). But I think this
will
     be quite challenging. That is, slice query from the start of the row
are
     easy. Everything else is harder (at least if you want to make it
efficient).

That was my 2 cents. Anyway, you can always open a JIRA ticket.

--
Sylvain


On Fri, Jan 14, 2011 at 7:38 AM, mike dooley <do...@apple.com> wrote:

> hi,
>
> the time-to-live feature in 0.7 is very nice and it made me want to ask
> about
> a somewhat similar feature.
>
> i have a stream of data consisting of entities and associated samples.  so
> i create
> a row for each entity and the columns in each row contain the samples for
> that entity.
> when i get around to processing  an entity i only care about the most
> recent N samples.
> so i read the most recent N columns and delete all the rest.
>
> what i would like would be a column family property that allows me to
> specify a maximum number of columns per row.  then i could just keep
> writing
> and not have to do the deletes.
>
> in my case it would be fine if the limit is only 'eventually' applied (so
> that
> sometimes there might be extra columns).
>
> does this seem like a generally useful feature?  if so, would it be hard to
> implement (maybe it could be done at compaction time like the TTL feature)?
>
> thanks,
> -mike