You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by aaron morton <aa...@thelastpickle.com> on 2012/01/02 22:24:20 UTC

Re: Row or Supercolumn with approximately n columns

Even if you had compaction enforcing a limit on the number of columns in a row, there would still be issues with concurrent writes at the same time and with read-repair. i.e. node a says the this is the first n columns but node b says something else, you only know who is correct at read time.

Have you considered using a TTL on the columns ? 

Depending on the use case you could also consider have writes periodically or randomly trim the data size, or trim on reads. 

It will also make sense to partition the time series data into different rows, and Viva la Standard Column Families!

Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 25/12/2011, at 7:48 PM, Praveen Baratam wrote:

> Hello Everybody,
> 
> Happy Christmas.
> 
> I know that this topic has come up quiet a few times on Dev and User lists but did not culminate into a solution.
> 
> http://www.mail-archive.com/user@cassandra.apache.org/msg15367.html
> 
> The above discussion on User list talks about AbstractCompactionStrategy but I could not find any relevant documentation as its a fairly new feature in Cassandra.
> 
> Let me state this necessity and use-case again.
> 
> I need a ColumnFamily (CF) wide or SuperColumn (SC) wide option to approximately limit the number of columns to "n". "n" can vary a lot and the intention is to throw away stale data and not to maintain any hard limit on the CF or SC. Its very useful for storing time-series data where stale data is not necessary. The goal is to achieve this with minimum overhead and since compaction happens all the time it would be clever to implement it as part of compaction.
> 
> Thanks in advance.
> 
> Praveen

Re: Row or Supercolumn with approximately n columns

Posted by Praveen Baratam <pr...@gmail.com>.

I understand that there will be contention regarding which *n* columns are
the current *n* columns but as mentioned previously the goal is to limit
the accumulation of data as in our use-case some row keys can receive
fairly heavy inserts. For people requiring precise set of current columns
that feature can be implemented by having a buffer of *m* columns above the
*n columns * so that they can filter in the client.

I believe this approach will not tax cassandra in terms of performance.

Coming to TTL based columns, its difficult to store last *n* samples in
this approach. If the inserts are happening at a constant/predictable rate
then we can achieve the desired functionality using TTL but if inserts are
event driven, then there is no way we can see the last *n* samples after
TTL. This may not be desirable in many use-cases including mine.

Another approach could be a cron job that reads all the rows and slices
every row to first *n* columns using batch_mutate. For this to be efficient
we need an efficient way to query for rows with more than n columns. This
could be a quick externally managed compaction if the performance penalty
can be minimized by some internal api provisions.

https://issues.apache.org/jira/browse/CASSANDRA-3678?page=com.atlassian.streams.streams-jira-plugin:activity-stream-issue-tab#issue-tabs

I have also opened the above ticket to collect ideas to solve this problem.
Sadly no activity yet.

Coming to custom compaction for this purpose a levelled compaction with
only 2 levels or just one could be enough as rows are not meant to grow
huge and most rows have similar number and sized columns.

Regards.

On Tue, Jan 3, 2012 at 4:29 AM, aaron morton <aa...@thelastpickle.com>wrote:

> During compaction, both automatic / minor and manual / major.
>
> The performance drop is having a lot of expired columns that have not been
> purged by compaction as they must be read and discarded during reads.
>
> Cheers
>
>   -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 3/01/2012, at 10:38 AM, R. Verlangen wrote:
>
> @Aaron: Small side question, when do columns with a past TTL get removed?
> On a repair, (minor) compaction, or .. ? Does it have a performance drop if
> that's happening?
>
> 2012/1/2 aaron morton <aa...@thelastpickle.com>
>
>> Even if you had compaction enforcing a limit on the number of columns in
>> a row, there would still be issues with concurrent writes at the same time
>> and with read-repair. i.e. node a says the this is the first n columns but
>> node b says something else, you only know who is correct at read time.
>>
>> Have you considered using a TTL on the columns ?
>>
>> Depending on the use case you could also consider have writes
>> periodically or randomly trim the data size, or trim on reads.
>>
>> It will also make sense to partition the time series data into different
>> rows, and Viva la Standard Column Families!
>>
>> Hope that helps.
>>
>>   -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 25/12/2011, at 7:48 PM, Praveen Baratam wrote:
>>
>> Hello Everybody,
>>
>> Happy Christmas.
>>
>> I know that this topic has come up quiet a few times on Dev and User
>> lists but did not culminate into a solution.
>>
>> http://www.mail-archive.com/user@cassandra.apache.org/msg15367.html
>>
>> The above discussion on User list talks about AbstractCompactionStrategy
>> but I could not find any relevant documentation as its a fairly new feature
>> in Cassandra.
>>
>> Let me state this necessity and use-case again.
>>
>> I need a ColumnFamily (CF) wide or SuperColumn (SC) wide option to
>> approximately limit the number of columns to "n". "n" can vary a lot and
>> the intention is to throw away stale data and not to maintain any hard
>> limit on the CF or SC. Its very useful for storing time-series data where
>> stale data is not necessary. The goal is to achieve this with minimum
>> overhead and since compaction happens all the time it would be clever to
>> implement it as part of compaction.
>>
>> Thanks in advance.
>>
>> Praveen
>>
>>
>>
>
>

Re: Row or Supercolumn with approximately n columns

Posted by aaron morton <aa...@thelastpickle.com>.

During compaction, both automatic / minor and manual / major. 

The performance drop is having a lot of expired columns that have not been purged by compaction as they must be read and discarded during reads. 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 3/01/2012, at 10:38 AM, R. Verlangen wrote:

> @Aaron: Small side question, when do columns with a past TTL get removed? On a repair, (minor) compaction, or .. ? Does it have a performance drop if that's happening?
> 
> 2012/1/2 aaron morton <aa...@thelastpickle.com>
> Even if you had compaction enforcing a limit on the number of columns in a row, there would still be issues with concurrent writes at the same time and with read-repair. i.e. node a says the this is the first n columns but node b says something else, you only know who is correct at read time.
> 
> Have you considered using a TTL on the columns ? 
> 
> Depending on the use case you could also consider have writes periodically or randomly trim the data size, or trim on reads. 
> 
> It will also make sense to partition the time series data into different rows, and Viva la Standard Column Families!
> 
> Hope that helps. 
>  
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 25/12/2011, at 7:48 PM, Praveen Baratam wrote:
> 
>> Hello Everybody,
>> 
>> Happy Christmas.
>> 
>> I know that this topic has come up quiet a few times on Dev and User lists but did not culminate into a solution.
>> 
>> http://www.mail-archive.com/user@cassandra.apache.org/msg15367.html
>> 
>> The above discussion on User list talks about AbstractCompactionStrategy but I could not find any relevant documentation as its a fairly new feature in Cassandra.
>> 
>> Let me state this necessity and use-case again.
>> 
>> I need a ColumnFamily (CF) wide or SuperColumn (SC) wide option to approximately limit the number of columns to "n". "n" can vary a lot and the intention is to throw away stale data and not to maintain any hard limit on the CF or SC. Its very useful for storing time-series data where stale data is not necessary. The goal is to achieve this with minimum overhead and since compaction happens all the time it would be clever to implement it as part of compaction.
>> 
>> Thanks in advance.
>> 
>> Praveen
> 
>

Re: Row or Supercolumn with approximately n columns

Posted by "R. Verlangen" <ro...@us2.nl>.

@Aaron: Small side question, when do columns with a past TTL get removed?
On a repair, (minor) compaction, or .. ? Does it have a performance drop if
that's happening?

2012/1/2 aaron morton <aa...@thelastpickle.com>

> Even if you had compaction enforcing a limit on the number of columns in a
> row, there would still be issues with concurrent writes at the same time
> and with read-repair. i.e. node a says the this is the first n columns but
> node b says something else, you only know who is correct at read time.
>
> Have you considered using a TTL on the columns ?
>
> Depending on the use case you could also consider have writes periodically
> or randomly trim the data size, or trim on reads.
>
> It will also make sense to partition the time series data into different
> rows, and Viva la Standard Column Families!
>
> Hope that helps.
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 25/12/2011, at 7:48 PM, Praveen Baratam wrote:
>
> Hello Everybody,
>
> Happy Christmas.
>
> I know that this topic has come up quiet a few times on Dev and User lists
> but did not culminate into a solution.
>
> http://www.mail-archive.com/user@cassandra.apache.org/msg15367.html
>
> The above discussion on User list talks about AbstractCompactionStrategy
> but I could not find any relevant documentation as its a fairly new feature
> in Cassandra.
>
> Let me state this necessity and use-case again.
>
> I need a ColumnFamily (CF) wide or SuperColumn (SC) wide option to
> approximately limit the number of columns to "n". "n" can vary a lot and
> the intention is to throw away stale data and not to maintain any hard
> limit on the CF or SC. Its very useful for storing time-series data where
> stale data is not necessary. The goal is to achieve this with minimum
> overhead and since compaction happens all the time it would be clever to
> implement it as part of compaction.
>
> Thanks in advance.
>
> Praveen
>
>
>