You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Edward Capriolo <ed...@gmail.com> on 2014/02/04 16:50:47 UTC

Re: Ultra wide row anti pattern

I have actually been building something similar in my space time. You can
hang around and wait for it or build your own. Here is the basics. Not
perfect but it will work.

Create column family queue with gc_grace_period=[1 day]

set queue [timeuuid()] ["z"+timeuuid()] = [ work do do]

The producer can decide how it wants to role over the row key and the
column key it does not matter.

Supposing there are N consumers. We need a way for the consumers to not do
the same work. We can use something like the bakery algorithm. Remember at
QUORUM a reader sees writes.

A consumer needs an identifier (it could be another uuid or an ip address)
A consumer calls get_range_slice on the queue the slice is from new byte[]
to byte[] limit 100

The consumer sees data like this.

[1234] [z-$timeuuid] = data

Now we register that this consumer wants to consume this queue

set [1234] [a-$[ip}] at quorum

Now we do a slice
get_slice [1234]  from new byte [] to ' b'

There are a few possible returns.
1) 1 bidder...
[1234] [a-$myip]
You won start consuming

2)  2 bidders
[1234] [a-$myip]
[1234] [a-$otherip]
compare $myip vs $otherip higher wins

Whoever wins can then start consuming the columns in the queue and delete
them when done.






On Friday, January 31, 2014, DuyHai Doan <do...@gmail.com> wrote:
> Thanks Nat for your ideas.
>>This could be as simple as adding year and month to the primary key (in
the form >'yyyymm'). Alternatively, you could add this in the partition in
the definition. Either way, it >then becomes pretty easy to re-generate
these based on the query parameters.
>
>  The thing is that it's not that simple. My customer has a very BAD idea,
using Cassandra as a queue (the perfect anti-pattern ever).
>  Before trying to tell them to redesign their entire architecture and put
in some queueing system like ActiveMQ or something similar, I would like to
see how I can use wide rows to meet the requirements.
>  The functional need is quite simple:
>  1) A process A loads users into Cassandra and sets the status on this
user to be 'TODO'. When using the bucketing technique, we can limit a row
width to, let's say 100 000 columns. So at the end of the current row,
process A knows that it should move to next bucket. Bucket is coded using
composite partition key, in our example it would be 'TODO:1', 'TODO:2' ....
etc
>
>  2) A process B reads the wide row for 'TODO' status. It starts at bucket
1 so it will read row with partition key 'TODO:1'. The users are processed
and inserted in a new row 'PROCESSED:1' for example to keep track of the
status. After retrieving 100 000 columns, it will switch automatically to
the next bucket. Simple. Fair enough
>
>  3) Now what sucks it that some time, process B does not have enough data
to perform functional logic on the user it fetched from the wide row, so it
has to REPUT some users back into the 'TODO' status rather than
transitioning to 'PROCESSED' status. That's exactly a queue behavior.
>  A simplistic idea would be to insert again those m users with 'TODO:n',
with n higher than the current bucket number so it can be processed later.
But then it screws up all the counting system. Process A which inserts data
will not know that there are already m users in row n, so will happily add
100 000 columns, making the row size grow to  100 000 + m. When process B
reads back again this row, it will stop at the first 100 000 columns and
skip the trailing m elements .
>   That 's the main reason for which I dropped the idea of bucketing
(which is quite smart in normal case) to trade for ultra wide row.
>  Any way, I'll follow your advice and play around with the parameters of
SizeTiered
>  Regards
>  Duy Hai DOAN
>
> On Fri, Jan 31, 2014 at 9:23 PM, Nate McCall <na...@thelastpickle.com>
wrote:
>>>
>>>  The only drawback for ultra wide row I can see is point 1). But if I
use leveled compaction with a sufficiently large value for
"sstable_size_in_mb" (let's say 200Mb), will my read performance be
impacted as the row grows ?
>>
>> For this use case, you would want to use SizeTieredCompaction and play
around with the configuration a bit to keep a small number of large
SSTables. Specifically: keep min|max_threshold really low, set bucket_low
and bucket_high closer together maybe even both to 1.0, and maybe a larger
min_sstable_size.
>> YMMV though - per Rob's suggestion, take the time to run some tests
tweaking these options.
>>
>>>
>>>  Of course, splitting wide row into several rows using bucketing
technique is one solution but it forces us to keep track of the bucket
number and it's not convenient. We have one process (jvm) that insert data
and another process (jvm) that read data. Using bucketing, we need to
synchronize the bucket number between the 2 processes.
>>
>> This could be as simple as adding year and month to the primary key (in
the form 'yyyymm'). Alternatively, you could add this in the partition in
the definition. Either way, it then becomes pretty easy to re-generate
these based on the query parameters.
>>
>>
>> --
>> -----------------
>> Nate McCall
>> Austin, TX
>> @zznate
>>
>> Co-Founder & Sr. Technical Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>

Re: Ultra wide row anti pattern

Posted by Edward Capriolo <ed...@gmail.com>.
You could use another column of CAS as a management layer. You only have to
consult it when picking up new rows.


On Tue, Feb 4, 2014 at 3:45 PM, DuyHai Doan <do...@gmail.com> wrote:

> Great idea for implementing queue pattern. Thank you Edward.
>
> However with your design there are still corner cases for 2 consumers to
> read from the same queue. Reading and writing with QUORUM does not prevent
> race conditions. I believe the new CAS feature of C* 2.0 might be useful
> here but with the expense of reduced throughput (because of the Paxos round)
>
>
>
>
> On Tue, Feb 4, 2014 at 4:50 PM, Edward Capriolo <ed...@gmail.com>wrote:
>
>> I have actually been building something similar in my space time. You can
>> hang around and wait for it or build your own. Here is the basics. Not
>> perfect but it will work.
>>
>> Create column family queue with gc_grace_period=[1 day]
>>
>> set queue [timeuuid()] ["z"+timeuuid()] = [ work do do]
>>
>> The producer can decide how it wants to role over the row key and the
>> column key it does not matter.
>>
>> Supposing there are N consumers. We need a way for the consumers to not
>> do the same work. We can use something like the bakery algorithm. Remember
>> at QUORUM a reader sees writes.
>>
>> A consumer needs an identifier (it could be another uuid or an ip
>> address)
>> A consumer calls get_range_slice on the queue the slice is from new
>> byte[] to byte[] limit 100
>>
>> The consumer sees data like this.
>>
>> [1234] [z-$timeuuid] = data
>>
>> Now we register that this consumer wants to consume this queue
>>
>> set [1234] [a-$[ip}] at quorum
>>
>> Now we do a slice
>> get_slice [1234]  from new byte [] to ' b'
>>
>> There are a few possible returns.
>> 1) 1 bidder...
>> [1234] [a-$myip]
>> You won start consuming
>>
>> 2)  2 bidders
>> [1234] [a-$myip]
>> [1234] [a-$otherip]
>> compare $myip vs $otherip higher wins
>>
>> Whoever wins can then start consuming the columns in the queue and delete
>> them when done.
>>
>>
>>
>>
>>
>>
>> On Friday, January 31, 2014, DuyHai Doan <do...@gmail.com> wrote:
>> > Thanks Nat for your ideas.
>> >>This could be as simple as adding year and month to the primary key (in
>> the form >'yyyymm'). Alternatively, you could add this in the partition in
>> the definition. Either way, it >then becomes pretty easy to re-generate
>> these based on the query parameters.
>> >
>> >  The thing is that it's not that simple. My customer has a very BAD
>> idea, using Cassandra as a queue (the perfect anti-pattern ever).
>> >  Before trying to tell them to redesign their entire architecture and
>> put in some queueing system like ActiveMQ or something similar, I would
>> like to see how I can use wide rows to meet the requirements.
>> >  The functional need is quite simple:
>> >  1) A process A loads users into Cassandra and sets the status on this
>> user to be 'TODO'. When using the bucketing technique, we can limit a row
>> width to, let's say 100 000 columns. So at the end of the current row,
>> process A knows that it should move to next bucket. Bucket is coded using
>> composite partition key, in our example it would be 'TODO:1', 'TODO:2' ....
>> etc
>> >
>> >  2) A process B reads the wide row for 'TODO' status. It starts at
>> bucket 1 so it will read row with partition key 'TODO:1'. The users are
>> processed and inserted in a new row 'PROCESSED:1' for example to keep track
>> of the status. After retrieving 100 000 columns, it will switch
>> automatically to the next bucket. Simple. Fair enough
>> >
>> >  3) Now what sucks it that some time, process B does not have enough
>> data to perform functional logic on the user it fetched from the wide row,
>> so it has to REPUT some users back into the 'TODO' status rather than
>> transitioning to 'PROCESSED' status. That's exactly a queue behavior.
>> >  A simplistic idea would be to insert again those m users with
>> 'TODO:n', with n higher than the current bucket number so it can be
>> processed later. But then it screws up all the counting system. Process A
>> which inserts data will not know that there are already m users in row n,
>> so will happily add 100 000 columns, making the row size grow to  100 000 +
>> m. When process B reads back again this row, it will stop at the first 100
>> 000 columns and skip the trailing m elements .
>> >   That 's the main reason for which I dropped the idea of bucketing
>> (which is quite smart in normal case) to trade for ultra wide row.
>> >  Any way, I'll follow your advice and play around with the parameters
>> of SizeTiered
>> >  Regards
>> >  Duy Hai DOAN
>> >
>> > On Fri, Jan 31, 2014 at 9:23 PM, Nate McCall <na...@thelastpickle.com>
>> wrote:
>> >>>
>> >>>  The only drawback for ultra wide row I can see is point 1). But if I
>> use leveled compaction with a sufficiently large value for
>> "sstable_size_in_mb" (let's say 200Mb), will my read performance be
>> impacted as the row grows ?
>> >>
>> >> For this use case, you would want to use SizeTieredCompaction and play
>> around with the configuration a bit to keep a small number of large
>> SSTables. Specifically: keep min|max_threshold really low, set bucket_low
>> and bucket_high closer together maybe even both to 1.0, and maybe a larger
>> min_sstable_size.
>> >> YMMV though - per Rob's suggestion, take the time to run some tests
>> tweaking these options.
>> >>
>> >>>
>> >>>  Of course, splitting wide row into several rows using bucketing
>> technique is one solution but it forces us to keep track of the bucket
>> number and it's not convenient. We have one process (jvm) that insert data
>> and another process (jvm) that read data. Using bucketing, we need to
>> synchronize the bucket number between the 2 processes.
>> >>
>> >> This could be as simple as adding year and month to the primary key
>> (in the form 'yyyymm'). Alternatively, you could add this in the partition
>> in the definition. Either way, it then becomes pretty easy to re-generate
>> these based on the query parameters.
>> >>
>> >>
>> >> --
>> >> -----------------
>> >> Nate McCall
>> >> Austin, TX
>> >> @zznate
>> >>
>> >> Co-Founder & Sr. Technical Consultant
>> >> Apache Cassandra Consulting
>> >> http://www.thelastpickle.com
>> >
>>
>
>

Re: Ultra wide row anti pattern

Posted by DuyHai Doan <do...@gmail.com>.
Great idea for implementing queue pattern. Thank you Edward.

However with your design there are still corner cases for 2 consumers to
read from the same queue. Reading and writing with QUORUM does not prevent
race conditions. I believe the new CAS feature of C* 2.0 might be useful
here but with the expense of reduced throughput (because of the Paxos round)




On Tue, Feb 4, 2014 at 4:50 PM, Edward Capriolo <ed...@gmail.com>wrote:

> I have actually been building something similar in my space time. You can
> hang around and wait for it or build your own. Here is the basics. Not
> perfect but it will work.
>
> Create column family queue with gc_grace_period=[1 day]
>
> set queue [timeuuid()] ["z"+timeuuid()] = [ work do do]
>
> The producer can decide how it wants to role over the row key and the
> column key it does not matter.
>
> Supposing there are N consumers. We need a way for the consumers to not do
> the same work. We can use something like the bakery algorithm. Remember at
> QUORUM a reader sees writes.
>
> A consumer needs an identifier (it could be another uuid or an ip address)
> A consumer calls get_range_slice on the queue the slice is from new byte[]
> to byte[] limit 100
>
> The consumer sees data like this.
>
> [1234] [z-$timeuuid] = data
>
> Now we register that this consumer wants to consume this queue
>
> set [1234] [a-$[ip}] at quorum
>
> Now we do a slice
> get_slice [1234]  from new byte [] to ' b'
>
> There are a few possible returns.
> 1) 1 bidder...
> [1234] [a-$myip]
> You won start consuming
>
> 2)  2 bidders
> [1234] [a-$myip]
> [1234] [a-$otherip]
> compare $myip vs $otherip higher wins
>
> Whoever wins can then start consuming the columns in the queue and delete
> them when done.
>
>
>
>
>
>
> On Friday, January 31, 2014, DuyHai Doan <do...@gmail.com> wrote:
> > Thanks Nat for your ideas.
> >>This could be as simple as adding year and month to the primary key (in
> the form >'yyyymm'). Alternatively, you could add this in the partition in
> the definition. Either way, it >then becomes pretty easy to re-generate
> these based on the query parameters.
> >
> >  The thing is that it's not that simple. My customer has a very BAD
> idea, using Cassandra as a queue (the perfect anti-pattern ever).
> >  Before trying to tell them to redesign their entire architecture and
> put in some queueing system like ActiveMQ or something similar, I would
> like to see how I can use wide rows to meet the requirements.
> >  The functional need is quite simple:
> >  1) A process A loads users into Cassandra and sets the status on this
> user to be 'TODO'. When using the bucketing technique, we can limit a row
> width to, let's say 100 000 columns. So at the end of the current row,
> process A knows that it should move to next bucket. Bucket is coded using
> composite partition key, in our example it would be 'TODO:1', 'TODO:2' ....
> etc
> >
> >  2) A process B reads the wide row for 'TODO' status. It starts at
> bucket 1 so it will read row with partition key 'TODO:1'. The users are
> processed and inserted in a new row 'PROCESSED:1' for example to keep track
> of the status. After retrieving 100 000 columns, it will switch
> automatically to the next bucket. Simple. Fair enough
> >
> >  3) Now what sucks it that some time, process B does not have enough
> data to perform functional logic on the user it fetched from the wide row,
> so it has to REPUT some users back into the 'TODO' status rather than
> transitioning to 'PROCESSED' status. That's exactly a queue behavior.
> >  A simplistic idea would be to insert again those m users with 'TODO:n',
> with n higher than the current bucket number so it can be processed later.
> But then it screws up all the counting system. Process A which inserts data
> will not know that there are already m users in row n, so will happily add
> 100 000 columns, making the row size grow to  100 000 + m. When process B
> reads back again this row, it will stop at the first 100 000 columns and
> skip the trailing m elements .
> >   That 's the main reason for which I dropped the idea of bucketing
> (which is quite smart in normal case) to trade for ultra wide row.
> >  Any way, I'll follow your advice and play around with the parameters of
> SizeTiered
> >  Regards
> >  Duy Hai DOAN
> >
> > On Fri, Jan 31, 2014 at 9:23 PM, Nate McCall <na...@thelastpickle.com>
> wrote:
> >>>
> >>>  The only drawback for ultra wide row I can see is point 1). But if I
> use leveled compaction with a sufficiently large value for
> "sstable_size_in_mb" (let's say 200Mb), will my read performance be
> impacted as the row grows ?
> >>
> >> For this use case, you would want to use SizeTieredCompaction and play
> around with the configuration a bit to keep a small number of large
> SSTables. Specifically: keep min|max_threshold really low, set bucket_low
> and bucket_high closer together maybe even both to 1.0, and maybe a larger
> min_sstable_size.
> >> YMMV though - per Rob's suggestion, take the time to run some tests
> tweaking these options.
> >>
> >>>
> >>>  Of course, splitting wide row into several rows using bucketing
> technique is one solution but it forces us to keep track of the bucket
> number and it's not convenient. We have one process (jvm) that insert data
> and another process (jvm) that read data. Using bucketing, we need to
> synchronize the bucket number between the 2 processes.
> >>
> >> This could be as simple as adding year and month to the primary key (in
> the form 'yyyymm'). Alternatively, you could add this in the partition in
> the definition. Either way, it then becomes pretty easy to re-generate
> these based on the query parameters.
> >>
> >>
> >> --
> >> -----------------
> >> Nate McCall
> >> Austin, TX
> >> @zznate
> >>
> >> Co-Founder & Sr. Technical Consultant
> >> Apache Cassandra Consulting
> >> http://www.thelastpickle.com
> >
>

Re: Ultra wide row anti pattern

Posted by Edward Capriolo <ed...@gmail.com>.
Generally you need to make a wide row because the row keys in cassandra are
ordered by their md5/murmer code. As a result you have no way of locating
"new rows", but if the row name is predictable the columns inside the row
are ordered.


On Tue, Feb 4, 2014 at 12:02 PM, Yogi Nerella <yn...@gmail.com> wrote:

> Sorry, I am not understanding the problem, and I am new to Cassandra, and
> want to understand this issue.
>
> Why do we need to use wide row for this situation, why not a simple table
> in cassandra?
>
> todolist  (user, state)   ==> is there any other information in this table
> which needs for processing todo?
> processedlist (user, state)
>
>
>
> On Tue, Feb 4, 2014 at 7:50 AM, Edward Capriolo <ed...@gmail.com>wrote:
>
>> I have actually been building something similar in my space time. You can
>> hang around and wait for it or build your own. Here is the basics. Not
>> perfect but it will work.
>>
>> Create column family queue with gc_grace_period=[1 day]
>>
>> set queue [timeuuid()] ["z"+timeuuid()] = [ work do do]
>>
>> The producer can decide how it wants to role over the row key and the
>> column key it does not matter.
>>
>> Supposing there are N consumers. We need a way for the consumers to not
>> do the same work. We can use something like the bakery algorithm. Remember
>> at QUORUM a reader sees writes.
>>
>> A consumer needs an identifier (it could be another uuid or an ip
>> address)
>> A consumer calls get_range_slice on the queue the slice is from new
>> byte[] to byte[] limit 100
>>
>> The consumer sees data like this.
>>
>> [1234] [z-$timeuuid] = data
>>
>> Now we register that this consumer wants to consume this queue
>>
>> set [1234] [a-$[ip}] at quorum
>>
>> Now we do a slice
>> get_slice [1234]  from new byte [] to ' b'
>>
>> There are a few possible returns.
>> 1) 1 bidder...
>> [1234] [a-$myip]
>> You won start consuming
>>
>> 2)  2 bidders
>> [1234] [a-$myip]
>> [1234] [a-$otherip]
>> compare $myip vs $otherip higher wins
>>
>> Whoever wins can then start consuming the columns in the queue and delete
>> them when done.
>>
>>
>>
>>
>>
>>
>> On Friday, January 31, 2014, DuyHai Doan <do...@gmail.com> wrote:
>> > Thanks Nat for your ideas.
>> >>This could be as simple as adding year and month to the primary key (in
>> the form >'yyyymm'). Alternatively, you could add this in the partition in
>> the definition. Either way, it >then becomes pretty easy to re-generate
>> these based on the query parameters.
>> >
>> >  The thing is that it's not that simple. My customer has a very BAD
>> idea, using Cassandra as a queue (the perfect anti-pattern ever).
>> >  Before trying to tell them to redesign their entire architecture and
>> put in some queueing system like ActiveMQ or something similar, I would
>> like to see how I can use wide rows to meet the requirements.
>> >  The functional need is quite simple:
>> >  1) A process A loads users into Cassandra and sets the status on this
>> user to be 'TODO'. When using the bucketing technique, we can limit a row
>> width to, let's say 100 000 columns. So at the end of the current row,
>> process A knows that it should move to next bucket. Bucket is coded using
>> composite partition key, in our example it would be 'TODO:1', 'TODO:2' ....
>> etc
>> >
>> >  2) A process B reads the wide row for 'TODO' status. It starts at
>> bucket 1 so it will read row with partition key 'TODO:1'. The users are
>> processed and inserted in a new row 'PROCESSED:1' for example to keep track
>> of the status. After retrieving 100 000 columns, it will switch
>> automatically to the next bucket. Simple. Fair enough
>> >
>> >  3) Now what sucks it that some time, process B does not have enough
>> data to perform functional logic on the user it fetched from the wide row,
>> so it has to REPUT some users back into the 'TODO' status rather than
>> transitioning to 'PROCESSED' status. That's exactly a queue behavior.
>> >  A simplistic idea would be to insert again those m users with
>> 'TODO:n', with n higher than the current bucket number so it can be
>> processed later. But then it screws up all the counting system. Process A
>> which inserts data will not know that there are already m users in row n,
>> so will happily add 100 000 columns, making the row size grow to  100 000 +
>> m. When process B reads back again this row, it will stop at the first 100
>> 000 columns and skip the trailing m elements .
>> >   That 's the main reason for which I dropped the idea of bucketing
>> (which is quite smart in normal case) to trade for ultra wide row.
>> >  Any way, I'll follow your advice and play around with the parameters
>> of SizeTiered
>> >  Regards
>> >  Duy Hai DOAN
>> >
>> > On Fri, Jan 31, 2014 at 9:23 PM, Nate McCall <na...@thelastpickle.com>
>> wrote:
>> >>>
>> >>>  The only drawback for ultra wide row I can see is point 1). But if I
>> use leveled compaction with a sufficiently large value for
>> "sstable_size_in_mb" (let's say 200Mb), will my read performance be
>> impacted as the row grows ?
>> >>
>> >> For this use case, you would want to use SizeTieredCompaction and play
>> around with the configuration a bit to keep a small number of large
>> SSTables. Specifically: keep min|max_threshold really low, set bucket_low
>> and bucket_high closer together maybe even both to 1.0, and maybe a larger
>> min_sstable_size.
>> >> YMMV though - per Rob's suggestion, take the time to run some tests
>> tweaking these options.
>> >>
>> >>>
>> >>>  Of course, splitting wide row into several rows using bucketing
>> technique is one solution but it forces us to keep track of the bucket
>> number and it's not convenient. We have one process (jvm) that insert data
>> and another process (jvm) that read data. Using bucketing, we need to
>> synchronize the bucket number between the 2 processes.
>> >>
>> >> This could be as simple as adding year and month to the primary key
>> (in the form 'yyyymm'). Alternatively, you could add this in the partition
>> in the definition. Either way, it then becomes pretty easy to re-generate
>> these based on the query parameters.
>> >>
>> >>
>> >> --
>> >> -----------------
>> >> Nate McCall
>> >> Austin, TX
>> >> @zznate
>> >>
>> >> Co-Founder & Sr. Technical Consultant
>> >> Apache Cassandra Consulting
>> >> http://www.thelastpickle.com
>> >
>>
>
>

Re: Ultra wide row anti pattern

Posted by Yogi Nerella <yn...@gmail.com>.
Sorry, I am not understanding the problem, and I am new to Cassandra, and
want to understand this issue.

Why do we need to use wide row for this situation, why not a simple table
in cassandra?

todolist  (user, state)   ==> is there any other information in this table
which needs for processing todo?
processedlist (user, state)



On Tue, Feb 4, 2014 at 7:50 AM, Edward Capriolo <ed...@gmail.com>wrote:

> I have actually been building something similar in my space time. You can
> hang around and wait for it or build your own. Here is the basics. Not
> perfect but it will work.
>
> Create column family queue with gc_grace_period=[1 day]
>
> set queue [timeuuid()] ["z"+timeuuid()] = [ work do do]
>
> The producer can decide how it wants to role over the row key and the
> column key it does not matter.
>
> Supposing there are N consumers. We need a way for the consumers to not do
> the same work. We can use something like the bakery algorithm. Remember at
> QUORUM a reader sees writes.
>
> A consumer needs an identifier (it could be another uuid or an ip address)
> A consumer calls get_range_slice on the queue the slice is from new byte[]
> to byte[] limit 100
>
> The consumer sees data like this.
>
> [1234] [z-$timeuuid] = data
>
> Now we register that this consumer wants to consume this queue
>
> set [1234] [a-$[ip}] at quorum
>
> Now we do a slice
> get_slice [1234]  from new byte [] to ' b'
>
> There are a few possible returns.
> 1) 1 bidder...
> [1234] [a-$myip]
> You won start consuming
>
> 2)  2 bidders
> [1234] [a-$myip]
> [1234] [a-$otherip]
> compare $myip vs $otherip higher wins
>
> Whoever wins can then start consuming the columns in the queue and delete
> them when done.
>
>
>
>
>
>
> On Friday, January 31, 2014, DuyHai Doan <do...@gmail.com> wrote:
> > Thanks Nat for your ideas.
> >>This could be as simple as adding year and month to the primary key (in
> the form >'yyyymm'). Alternatively, you could add this in the partition in
> the definition. Either way, it >then becomes pretty easy to re-generate
> these based on the query parameters.
> >
> >  The thing is that it's not that simple. My customer has a very BAD
> idea, using Cassandra as a queue (the perfect anti-pattern ever).
> >  Before trying to tell them to redesign their entire architecture and
> put in some queueing system like ActiveMQ or something similar, I would
> like to see how I can use wide rows to meet the requirements.
> >  The functional need is quite simple:
> >  1) A process A loads users into Cassandra and sets the status on this
> user to be 'TODO'. When using the bucketing technique, we can limit a row
> width to, let's say 100 000 columns. So at the end of the current row,
> process A knows that it should move to next bucket. Bucket is coded using
> composite partition key, in our example it would be 'TODO:1', 'TODO:2' ....
> etc
> >
> >  2) A process B reads the wide row for 'TODO' status. It starts at
> bucket 1 so it will read row with partition key 'TODO:1'. The users are
> processed and inserted in a new row 'PROCESSED:1' for example to keep track
> of the status. After retrieving 100 000 columns, it will switch
> automatically to the next bucket. Simple. Fair enough
> >
> >  3) Now what sucks it that some time, process B does not have enough
> data to perform functional logic on the user it fetched from the wide row,
> so it has to REPUT some users back into the 'TODO' status rather than
> transitioning to 'PROCESSED' status. That's exactly a queue behavior.
> >  A simplistic idea would be to insert again those m users with 'TODO:n',
> with n higher than the current bucket number so it can be processed later.
> But then it screws up all the counting system. Process A which inserts data
> will not know that there are already m users in row n, so will happily add
> 100 000 columns, making the row size grow to  100 000 + m. When process B
> reads back again this row, it will stop at the first 100 000 columns and
> skip the trailing m elements .
> >   That 's the main reason for which I dropped the idea of bucketing
> (which is quite smart in normal case) to trade for ultra wide row.
> >  Any way, I'll follow your advice and play around with the parameters of
> SizeTiered
> >  Regards
> >  Duy Hai DOAN
> >
> > On Fri, Jan 31, 2014 at 9:23 PM, Nate McCall <na...@thelastpickle.com>
> wrote:
> >>>
> >>>  The only drawback for ultra wide row I can see is point 1). But if I
> use leveled compaction with a sufficiently large value for
> "sstable_size_in_mb" (let's say 200Mb), will my read performance be
> impacted as the row grows ?
> >>
> >> For this use case, you would want to use SizeTieredCompaction and play
> around with the configuration a bit to keep a small number of large
> SSTables. Specifically: keep min|max_threshold really low, set bucket_low
> and bucket_high closer together maybe even both to 1.0, and maybe a larger
> min_sstable_size.
> >> YMMV though - per Rob's suggestion, take the time to run some tests
> tweaking these options.
> >>
> >>>
> >>>  Of course, splitting wide row into several rows using bucketing
> technique is one solution but it forces us to keep track of the bucket
> number and it's not convenient. We have one process (jvm) that insert data
> and another process (jvm) that read data. Using bucketing, we need to
> synchronize the bucket number between the 2 processes.
> >>
> >> This could be as simple as adding year and month to the primary key (in
> the form 'yyyymm'). Alternatively, you could add this in the partition in
> the definition. Either way, it then becomes pretty easy to re-generate
> these based on the query parameters.
> >>
> >>
> >> --
> >> -----------------
> >> Nate McCall
> >> Austin, TX
> >> @zznate
> >>
> >> Co-Founder & Sr. Technical Consultant
> >> Apache Cassandra Consulting
> >> http://www.thelastpickle.com
> >
>