You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Ivan Chang <iv...@medigy.com> on 2009/07/17 16:14:28 UTC

Concurrent updates

I have the following scenario that would like a best solution for.

Here's the scenario:

Table1.Standard1['cassandra']['frequency']

it is used for keeping track of how many times the word "cassandra"
appeared.

Let's say we have a bunch of articles stored in Hadoop, a Map/Reduce greps
all articles throughout the Hadoop cluster that matches the pattern
^cassandra$
and updates Table1.Standard1['cassandra']['frequency'].  Hence
Table1.Standard1['cassandra']['frequency'] will be updated concurrently.

One of the issues I am facing is that
Table1.Standard1['cassandra']['frequency']
stores the count as a String (I am using Java), so in order to update the
frequency
properly, the thread that's running the Map/Reduce will have to retrieve
Table1.Standard1['cassandra']['frequency'] in its native String format and
hold
that in temp (java Sttring), convert into int, then add the new counts in,
and finally
"SET Table1.Standard1['cassandra']['frequency']. =  '" + temp.toString() +
''"

During the entire process, how do we guranatee concurrency.  The Cql SET
does
not allow something like

SET Table1.Standard1['cassandra']['frequency']. =
Table1.Standard1['cassandra']['frequency']. + newCounts

since there's only one String type.

What would be the best solution in this situtaion?

Thanks,
Ivan

Re: Concurrent updates

Posted by Michael Greene <mi...@gmail.com>.

Even if CQL SET allowed for the operation you're describing, it's at
odds with the availability and consistency constrains of Cassandra.
Another process, somewhere else, could be reading and writing that
frequency value at the same time.  Reducing the operation to one
statement does not make it transactional or idempotent.

Unless you are looking for estimates in that cell and the delay
between processing updates to that cell is large enough to provide
reasonable estimates, you will want to look at a queueing solution or
a transaction solution outside of Cassandra.  There are a few issues
open in JIRA that would allow you to up the consistency on this
particular read/write call to ensure that you are getting better
estimates, but this is a scenario that Cassandra does not handle well.

If you can think of a way to model your operation to be idempotent,
then that would be preferable.  Otherwise an external queue (such as
AMQP) or transaction system (such as Zookeeper) is all I can think of
at the moment.

Michael

On Fri, Jul 17, 2009 at 9:14 AM, Ivan Chang<iv...@medigy.com> wrote:
> I have the following scenario that would like a best solution for.
>
> Here's the scenario:
>
> Table1.Standard1['cassandra']['frequency']
>
> it is used for keeping track of how many times the word "cassandra"
> appeared.
>
> Let's say we have a bunch of articles stored in Hadoop, a Map/Reduce greps
> all articles throughout the Hadoop cluster that matches the pattern
> ^cassandra$
> and updates Table1.Standard1['cassandra']['frequency'].  Hence
> Table1.Standard1['cassandra']['frequency'] will be updated concurrently.
>
> One of the issues I am facing is that
> Table1.Standard1['cassandra']['frequency']
> stores the count as a String (I am using Java), so in order to update the
> frequency
> properly, the thread that's running the Map/Reduce will have to retrieve
> Table1.Standard1['cassandra']['frequency'] in its native String format and
> hold
> that in temp (java Sttring), convert into int, then add the new counts in,
> and finally
> "SET Table1.Standard1['cassandra']['frequency']. =  '" + temp.toString() +
> ''"
>
> During the entire process, how do we guranatee concurrency.  The Cql SET
> does
> not allow something like
>
> SET Table1.Standard1['cassandra']['frequency']. =
> Table1.Standard1['cassandra']['frequency']. + newCounts
>
> since there's only one String type.
>
> What would be the best solution in this situtaion?
>
> Thanks,
> Ivan

Re: Concurrent updates

Posted by Sandeep Tata <sa...@gmail.com>.

You could (for now) store counters in
Table1.Standard1['cassandra']['frequency-mapperid'].
At the end, you do a get_slice and add them up.
This is really bad for fault-tolerance -- you'll get wrong counts if
mappers were restarted because of failures. But then, you'd have the
same problem if you (transactionally) incremented a single counter
too.
This way, modulo failures your answer is still correct.



On Fri, Jul 17, 2009 at 8:41 AM, Jonathan Ellis<jb...@gmail.com> wrote:
> This is the kind of inconsistency that vector clocks can handle but
> the more simplistic timestamp-based resolution cannot.
>
> Of test-and-set vs vector clocks, vector clocks fits cassandra much better.
>
> -Jonathan
>
> On Fri, Jul 17, 2009 at 9:59 AM, Jun Rao<ju...@almaden.ibm.com> wrote:
>> This is a case where a test-and-set feature would be useful. See the
>> following JIRA. We just don't have it nailed down yet.
>> https://issues.apache.org/jira/browse/CASSANDRA-48
>>
>> Jun
>> IBM Almaden Research Center
>> K55/B1, 650 Harry Road, San Jose, CA 95120-6099
>>
>> junrao@almaden.ibm.com
>>
>> Ivan Chang <iv...@medigy.com>
>>
>>
>> Ivan Chang <iv...@medigy.com>
>>
>> 07/17/2009 07:14 AM
>>
>> Please respond to
>> cassandra-user@incubator.apache.org
>>
>> To
>> cassandra-user@incubator.apache.org
>> cc
>>
>> Subject
>> Concurrent updates
>> I have the following scenario that would like a best solution for.
>>
>> Here's the scenario:
>>
>> Table1.Standard1['cassandra']['frequency']
>>
>> it is used for keeping track of how many times the word "cassandra"
>> appeared.
>>
>> Let's say we have a bunch of articles stored in Hadoop, a Map/Reduce greps
>> all articles throughout the Hadoop cluster that matches the pattern
>> ^cassandra$
>> and updates Table1.Standard1['cassandra']['frequency'].  Hence
>> Table1.Standard1['cassandra']['frequency'] will be updated concurrently.
>>
>> One of the issues I am facing is that
>> Table1.Standard1['cassandra']['frequency']
>> stores the count as a String (I am using Java), so in order to update the
>> frequency
>> properly, the thread that's running the Map/Reduce will have to retrieve
>> Table1.Standard1['cassandra']['frequency'] in its native String format and
>> hold
>> that in temp (java Sttring), convert into int, then add the new counts in,
>> and finally
>> "SET Table1.Standard1['cassandra']['frequency']. =  '" + temp.toString() +
>> ''"
>>
>> During the entire process, how do we guranatee concurrency.  The Cql SET
>> does
>> not allow something like
>>
>> SET Table1.Standard1['cassandra']['frequency']. =
>> Table1.Standard1['cassandra']['frequency']. + newCounts
>>
>> since there's only one String type.
>>
>> What would be the best solution in this situtaion?
>>
>> Thanks,
>> Ivan
>>
>

Re: Concurrent updates

Posted by Jonathan Ellis <jb...@gmail.com>.

This is the kind of inconsistency that vector clocks can handle but
the more simplistic timestamp-based resolution cannot.

Of test-and-set vs vector clocks, vector clocks fits cassandra much better.

-Jonathan

On Fri, Jul 17, 2009 at 9:59 AM, Jun Rao<ju...@almaden.ibm.com> wrote:
> This is a case where a test-and-set feature would be useful. See the
> following JIRA. We just don't have it nailed down yet.
> https://issues.apache.org/jira/browse/CASSANDRA-48
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA 95120-6099
>
> junrao@almaden.ibm.com
>
> Ivan Chang <iv...@medigy.com>
>
>
> Ivan Chang <iv...@medigy.com>
>
> 07/17/2009 07:14 AM
>
> Please respond to
> cassandra-user@incubator.apache.org
>
> To
> cassandra-user@incubator.apache.org
> cc
>
> Subject
> Concurrent updates
> I have the following scenario that would like a best solution for.
>
> Here's the scenario:
>
> Table1.Standard1['cassandra']['frequency']
>
> it is used for keeping track of how many times the word "cassandra"
> appeared.
>
> Let's say we have a bunch of articles stored in Hadoop, a Map/Reduce greps
> all articles throughout the Hadoop cluster that matches the pattern
> ^cassandra$
> and updates Table1.Standard1['cassandra']['frequency'].  Hence
> Table1.Standard1['cassandra']['frequency'] will be updated concurrently.
>
> One of the issues I am facing is that
> Table1.Standard1['cassandra']['frequency']
> stores the count as a String (I am using Java), so in order to update the
> frequency
> properly, the thread that's running the Map/Reduce will have to retrieve
> Table1.Standard1['cassandra']['frequency'] in its native String format and
> hold
> that in temp (java Sttring), convert into int, then add the new counts in,
> and finally
> "SET Table1.Standard1['cassandra']['frequency']. =  '" + temp.toString() +
> ''"
>
> During the entire process, how do we guranatee concurrency.  The Cql SET
> does
> not allow something like
>
> SET Table1.Standard1['cassandra']['frequency']. =
> Table1.Standard1['cassandra']['frequency']. + newCounts
>
> since there's only one String type.
>
> What would be the best solution in this situtaion?
>
> Thanks,
> Ivan
>

Re: Concurrent updates

Posted by Jun Rao <ju...@almaden.ibm.com>.

This is a case where a test-and-set feature would be useful. See the
following JIRA. We just don't have it nailed down yet.
https://issues.apache.org/jira/browse/CASSANDRA-48

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

junrao@almaden.ibm.com



                                                                           
             Ivan Chang                                                    
             <ivan.chang@medig                                             
             y.com>                                                     To 
                                       cassandra-user@incubator.apache.org 
             07/17/2009 07:14                                           cc 
             AM                                                            
                                                                   Subject 
                                       Concurrent updates                  
             Please respond to                                             
             cassandra-user@in                                             
             cubator.apache.or                                             
                     g                                                     
                                                                           
                                                                           




I have the following scenario that would like a best solution for.

Here's the scenario:

Table1.Standard1['cassandra']['frequency']

it is used for keeping track of how many times the word "cassandra"
appeared.

Let's say we have a bunch of articles stored in Hadoop, a Map/Reduce greps
all articles throughout the Hadoop cluster that matches the pattern
^cassandra$
and updates Table1.Standard1['cassandra']['frequency'].  Hence
Table1.Standard1['cassandra']['frequency'] will be updated concurrently.

One of the issues I am facing is that Table1.Standard1
['cassandra']['frequency']
stores the count as a String (I am using Java), so in order to update the
frequency
properly, the thread that's running the Map/Reduce will have to retrieve
Table1.Standard1['cassandra']['frequency'] in its native String format and
hold
that in temp (java Sttring), convert into int, then add the new counts in,
and finally
"SET Table1.Standard1['cassandra']['frequency']. =  '" + temp.toString() +
''"

During the entire process, how do we guranatee concurrency.  The Cql SET
does
not allow something like

SET Table1.Standard1['cassandra']['frequency']. = Table1.Standard1
['cassandra']['frequency']. + newCounts

since there's only one String type.

What would be the best solution in this situtaion?

Thanks,
Ivan