You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by John Laban <jo...@pagerduty.com> on 2011/12/05 20:40:27 UTC

best practices for simulating transactions in Cassandra

Hello,

I'm building a system using Cassandra as a datastore and I have a few
places where I am need of transactions.

I'm using ZooKeeper to provide locking when I'm in need of some concurrency
control or isolation, so that solves that half of the puzzle.

What I need now is to sometimes be able to get atomicity across multiple
writes by simulating the "begin/rollback/commit" abilities of a relational
DB. In other words, there are places where I need to perform multiple
updates/inserts, and if I fail partway through, I would ideally be able to
rollback the partially-applied updates.

Now, I *know* this isn't possible with Cassandra. What I'm looking for are
all the best practices, or at least tips and tricks, so that I can get
around this limitation in Cassandra and still maintain a consistent
datastore. (I am using quorum reads/writes so that eventual consistency
doesn't kick my ass here as well.)

Below are some ideas I've been able to dig up. Please let me know if any
of them don't make sense, or if there are better approaches:

1) Updates to a row in a column family are atomic. So try to model your
data so that you would only ever need to update a single row in a single CF
at once. Essentially, you model your data around transactions. This is
tricky but can certainly be done in some situations.

2) If you are only dealing with multiple row *inserts* (and not updates),
have one of the rows act as a 'commit' by essentially validating the
presence of the other rows. For example, say you were performing an
operation where you wanted to create an Account row and 5 User rows all at
once (this is an unlikely example, but bear with me). You could insert 5
rows into the Users CF, and then the 1 row into the Accounts CF, which acts
as the commit. If something went wrong before the Account could be
created, any Users that had been created so far would be orphaned and
unusable, as your business logic can ensure that they can't exist without
an Account. You could also have an offline cleanup process that swept away
orphans.

3) Try to model your updates as idempotent column inserts instead. How do
you model updates as inserts? Instead of munging the value directly, you
could insert a column containing the operation you want to perform (like
"+5"). It would work kind of like the Consistent Vote Counting
implementation: ( https://gist.github.com/416666 ). How do you make the
inserts idempotent? Make sure the column names correspond to a request ID
or some other identifier that would be identical across re-drives of a
given (perhaps originally failed) request. This could leave your datastore
in a temporarily inconsistent state, but would eventually become consistent
after a successful re-drive of the original request.

4) You could take an approach like Dominic Williams proposed with Cages:
http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/
The gist is that you snapshot all the original values that you're
about
to munge somewhere else (in his case, ZooKeeper), make your updates, and
then delete the snapshot (and that delete needs to be atomic). If the
snapshot data was never deleted, then subsequent accessors (even readers)
of the data rows need to do the rollback of the previous transaction
themselves before they can read/write this data. They do the rollback by
just overwriting the current values with what is in the snapshot. It
offloads the work of the rollback to the next worker that accesses the
data. This approach probably needs an generic/high-level programming layer
to handle all of the details and complexity, and it doesn't seem like it
was ever added to Cages.

Are there other approaches or best practices that I missed? I would be
very interested in hearing any opinions from those who have tackled these
problems before.

Thanks!
John

Re: best practices for simulating transactions in Cassandra

Posted by John Laban <jo...@pagerduty.com>.

I'm actually using Curator as a Zookeeper client myself.  I haven't used it
in production yet, but so far it seems well written and Jordan Zimmerman at
Netflix has been great on the support end as well.

I haven't tried Cages so I can't really compare, but I think one of the
main deciding factors between the two depends on which zk recipes you need.

John


On Thu, Dec 15, 2011 at 12:07 AM, Boris Yen <yu...@gmail.com> wrote:

> I am not sure if this is the right thread to ask about this.
>
> I read that some people are using cage+zookeeper. I was wondering if
> anyone evaluates https://github.com/Netflix/curator? this seems to be a
> versatile package.
>
> On Tue, Dec 13, 2011 at 6:06 AM, John Laban <jo...@pagerduty.com> wrote:
>
>> Ok, great.  I'll be sure to look into the virtualization-specific NTP
>> guides.
>>
>> Another benefit of using Cassandra over Zookeeper for locking is that you
>> don't have to worry about losing your connection to Zookeeper (and with it
>> your locks) while hammering away at data in Cassandra.  If using Cassandra
>> for locks, if you lose your locks you lose your connection to the datastore
>> too.   (We're using long-ish session timeouts + connection listeners in ZK
>> to mitigate that now.)
>>
>> John
>>
>>
>>
>> On Mon, Dec 12, 2011 at 12:55 PM, Dominic Williams <
>> dwilliams@fightmymonster.com> wrote:
>>
>>> Hi John,
>>>
>>> On 12 December 2011 19:35, John Laban <jo...@pagerduty.com> wrote:
>>>>
>>>> So I responded to your algorithm in another part of this thread (very
>>>> interesting) but this part of the paper caught my attention:
>>>>
>>>> > When client application code releases a lock, that lock must not
>>>> actually be
>>>> > released for a period equal to one millisecond plus twice the maximum
>>>> possible
>>>> > drift of the clocks in the client computers accessing the Cassandra
>>>> databases
>>>>
>>>> I've been worried about this, and added some arbitrary delay in the
>>>> releasing of my locks.  But I don't like it as it's (A) an arbitrary value
>>>> and (B) it will - perhaps greatly - reduce the throughput of the more
>>>> high-contention areas of my system.
>>>>
>>>> To fix (B) I'll probably just have to try to get rid of locks all
>>>> together in these high-contention areas.
>>>>
>>>> To fix (A), I'd need to know what the maximum possible drift of my
>>>> clocks will be.  How did you determine this?  What value do you use, out of
>>>> curiosity?  What does the network layout of your client machines look like?
>>>>  (Are any of your hosts geographically separated or all running in the same
>>>> DC?  What's the maximum latency between hosts?  etc?)  Do you monitor the
>>>> clock skew on an ongoing basis?  Am I worrying too much?
>>>>
>>>
>>> If you setup NTP carefully no machine should drift more than 4ms say. I
>>> forget where, but you'll find the best documentation on how to make a
>>> bullet-proof NTP setup on vendor sites for virtualization software (because
>>> virtualization software can cause drift so NTP setup has to be just so)
>>>
>>> What this means is that, for example, to be really safe when a thread
>>> releases a lock you should wait say 9ms. Some points:-
>>> -- since the sleep is performed before release, an isolated operation
>>> should not be delayed at all
>>> -- only a waiting thread or a thread requesting a lock immediately it is
>>> released will be delayed, and no extra CPU or memory load is involved
>>> -- in practice for the vast majority of "application layer" data
>>> operations this restriction will have no effect on overall performance as
>>> experienced by a user, because such operations nearly always read and write
>>> to data with limited scope, for example the data of two users involved in
>>> some transaction
>>> -- the clocks issue does mean that you can't really serialize access to
>>> more broadly shared data where more than 5 or 10 such requests are made a
>>> second, say, but in reality even if the extra 9ms sleep on release wasn't
>>> necessary, variability in database operation execution time (say under
>>> load, or when something goes wrong) means trouble might occur serializing
>>> with that level of contention
>>>
>>> So in summary, although this drift thing seems bad at first, partly
>>> because it is a new consideration, in practice it's no big deal so long as
>>> you look after your clocks (and the main issue to watch out for is when
>>> application nodes running on virtualization software, hypervisors et al
>>> have setup issues that make their clocks drift under load, and it is a good
>>> idea to be wary of that)
>>>
>>> Best, Dominic
>>>
>>>
>>>> Sorry for all the questions but I'm very concerned about this
>>>> particular problem :)
>>>>
>>>> Thanks,
>>>> John
>>>>
>>>>
>>>> On Mon, Dec 12, 2011 at 4:36 AM, Dominic Williams <
>>>> dwilliams@fightmymonster.com> wrote:
>>>>
>>>>> Hi guys, just thought I'd chip in...
>>>>>
>>>>> Fight My Monster is still using Cages, which is working fine, but...
>>>>>
>>>>> I'm looking at using Cassandra to replace Cages/ZooKeeper(!) There are
>>>>> 2 main reasons:-
>>>>>
>>>>> 1. Although a fast ZooKeeper cluster can handle a lot of load (we
>>>>> aren't getting anywhere near to capacity and we do a *lot*
>>>>> of serialisation) at some point it will be necessary to start hashing lock
>>>>> paths onto separate ZooKeeper clusters, and I tend to believe that these
>>>>> days you should choose platforms that handle sharding themselves (e.g.
>>>>> choose Cassandra rather than MySQL)
>>>>>
>>>>> 2. Why have more components in your system when you can have less!!!
>>>>> KISS
>>>>>
>>>>> Recently I therefore tried to devise an algorithm which can be used to
>>>>> add a distributed locking layer to clients such as Pelops, Hector, Pycassa
>>>>> etc.
>>>>>
>>>>> There is a doc describing the algorithm, to which may be added an
>>>>> appendix describing a protocol so that locking can be interoperable between
>>>>> the clients. That could be extended to describe a protocol for
>>>>> transactions. Word of warning this is a *beta* algorithm that has only been
>>>>> seen by a select group so far, and therefore not even 100% sure it works
>>>>> but there is a useful general discussion regarding serialization of
>>>>> reads/writes so I include it anyway (and since this algorithm is going to
>>>>> be out there now, if there's anyone out there who fancies doing a Z proof
>>>>> or disproof, that would be fantastic).
>>>>>
>>>>> http://media.fightmymonster.com/Shared/docs/Wait%20Chain%20Algorithm.pdf
>>>>>
>>>>> Final word on this re transactions: if/when transactions are added to
>>>>> locking system in Pelops/Hector/Pycassa, Cassandra will provide better
>>>>> performance than ZooKeeper for storing snapshots, especially as transaction
>>>>> size increases
>>>>>
>>>>> Best, Dominic
>>>>>
>>>>> On 11 December 2011 01:53, Guy Incognito <dn...@gmail.com> wrote:
>>>>>
>>>>>>  you could try writing with the clock of the initial replay entry?
>>>>>>
>>>>>> On 06/12/2011 20:26, John Laban wrote:
>>>>>>
>>>>>> Ah, neat.  It is similar to what was proposed in (4) above with
>>>>>> adding transactions to Cages, but instead of snapshotting the data to be
>>>>>> rolled back (the "before" data), you snapshot the data to be replayed (the
>>>>>> "after" data).  And then later, if you find that the transaction didn't
>>>>>> complete, you just keep replaying the transaction until it takes.
>>>>>>
>>>>>>  The part I don't understand with this approach though:  how do you
>>>>>> ensure that someone else didn't change the data between your initial failed
>>>>>> transaction and the later replaying of the transaction?  You could get lost
>>>>>> writes in that situation.
>>>>>>
>>>>>>  Dominic (in the Cages blog post) explained a workaround with that
>>>>>> for his rollback proposal:  all subsequent readers or writers of that data
>>>>>> would have to check for abandoned transactions and roll them back
>>>>>> themselves before they could read the data.  I don't think this is possible
>>>>>> with the XACT_LOG "replay" approach in these slides though, based on how
>>>>>> the data is indexed (cassandra node token + timeUUID).
>>>>>>
>>>>>>
>>>>>>  PS:  How are you liking Cages?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2011/12/6 Jérémy SEVELLEC <js...@gmail.com>
>>>>>>
>>>>>>> Hi John,
>>>>>>>
>>>>>>>  I had exactly the same reflexions.
>>>>>>>
>>>>>>>  I'm using zookeeper and cage to lock et isolate.
>>>>>>>
>>>>>>>  but how to rollback?
>>>>>>> It's impossible so try replay!
>>>>>>>
>>>>>>>  the idea is explained in this presentation
>>>>>>> http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
>>>>>>> from slide 24)
>>>>>>>
>>>>>>>  - insert your whole data into one column
>>>>>>> - make the job
>>>>>>> - remove (or expire) your column.
>>>>>>>
>>>>>>>  if there is a problem during "making the job", you keep the
>>>>>>> possibility to replay and replay and replay (synchronously or in a batch).
>>>>>>>
>>>>>>>  Regards
>>>>>>>
>>>>>>>  Jérémy
>>>>>>>
>>>>>>>
>>>>>>> 2011/12/5 John Laban <jo...@pagerduty.com>
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>>  I'm building a system using Cassandra as a datastore and I have a
>>>>>>>> few places where I am need of transactions.
>>>>>>>>
>>>>>>>>  I'm using ZooKeeper to provide locking when I'm in need of some
>>>>>>>> concurrency control or isolation, so that solves that half of the puzzle.
>>>>>>>>
>>>>>>>>  What I need now is to sometimes be able to get atomicity across
>>>>>>>> multiple writes by simulating the "begin/rollback/commit" abilities of a
>>>>>>>> relational DB.  In other words, there are places where I need to perform
>>>>>>>> multiple updates/inserts, and if I fail partway through, I would ideally be
>>>>>>>> able to rollback the partially-applied updates.
>>>>>>>>
>>>>>>>>  Now, I *know* this isn't possible with Cassandra.  What I'm
>>>>>>>> looking for are all the best practices, or at least tips and tricks, so
>>>>>>>> that I can get around this limitation in Cassandra and still maintain a
>>>>>>>> consistent datastore.  (I am using quorum reads/writes so that eventual
>>>>>>>> consistency doesn't kick my ass here as well.)
>>>>>>>>
>>>>>>>>  Below are some ideas I've been able to dig up.  Please let me
>>>>>>>> know if any of them don't make sense, or if there are better approaches:
>>>>>>>>
>>>>>>>>
>>>>>>>>  1) Updates to a row in a column family are atomic.  So try to
>>>>>>>> model your data so that you would only ever need to update a single row in
>>>>>>>> a single CF at once.  Essentially, you model your data around transactions.
>>>>>>>>  This is tricky but can certainly be done in some situations.
>>>>>>>>
>>>>>>>>  2) If you are only dealing with multiple row *inserts* (and not
>>>>>>>> updates), have one of the rows act as a 'commit' by essentially validating
>>>>>>>> the presence of the other rows.  For example, say you were performing an
>>>>>>>> operation where you wanted to create an Account row and 5 User rows all at
>>>>>>>> once (this is an unlikely example, but bear with me).  You could insert 5
>>>>>>>> rows into the Users CF, and then the 1 row into the Accounts CF, which acts
>>>>>>>> as the commit.  If something went wrong before the Account could be
>>>>>>>> created, any Users that had been created so far would be orphaned and
>>>>>>>> unusable, as your business logic can ensure that they can't exist without
>>>>>>>> an Account.  You could also have an offline cleanup process that swept away
>>>>>>>> orphans.
>>>>>>>>
>>>>>>>>  3) Try to model your updates as idempotent column inserts
>>>>>>>> instead.  How do you model updates as inserts?  Instead of munging the
>>>>>>>> value directly, you could insert a column containing the operation you want
>>>>>>>> to perform (like "+5").  It would work kind of like the Consistent Vote
>>>>>>>> Counting implementation: ( https://gist.github.com/416666 ).  How
>>>>>>>> do you make the inserts idempotent?  Make sure the column names correspond
>>>>>>>> to a request ID or some other identifier that would be identical across
>>>>>>>> re-drives of a given (perhaps originally failed) request.  This could leave
>>>>>>>> your datastore in a temporarily inconsistent state, but would eventually
>>>>>>>> become consistent after a successful re-drive of the original request.
>>>>>>>>
>>>>>>>>  4) You could take an approach like Dominic Williams proposed with
>>>>>>>> Cages:
>>>>>>>> http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/   The gist is that you snapshot all the original values that you're about
>>>>>>>> to munge somewhere else (in his case, ZooKeeper), make your updates, and
>>>>>>>> then delete the snapshot (and that delete needs to be atomic).  If the
>>>>>>>> snapshot data was never deleted, then subsequent accessors (even readers)
>>>>>>>> of the data rows need to do the rollback of the previous transaction
>>>>>>>> themselves before they can read/write this data.  They do the rollback by
>>>>>>>> just overwriting the current values with what is in the snapshot.  It
>>>>>>>> offloads the work of the rollback to the next worker that accesses the
>>>>>>>> data.  This approach probably needs an generic/high-level programming layer
>>>>>>>> to handle all of the details and complexity, and it doesn't seem like it
>>>>>>>> was ever added to Cages.
>>>>>>>>
>>>>>>>>
>>>>>>>>  Are there other approaches or best practices that I missed?  I
>>>>>>>> would be very interested in hearing any opinions from those who have
>>>>>>>> tackled these problems before.
>>>>>>>>
>>>>>>>>  Thanks!
>>>>>>>>  John
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   --
>>>>>>> Jérémy
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: best practices for simulating transactions in Cassandra

Posted by Boris Yen <yu...@gmail.com>.

I am not sure if this is the right thread to ask about this.

I read that some people are using cage+zookeeper. I was wondering if anyone
evaluates https://github.com/Netflix/curator? this seems to be a versatile
package.

On Tue, Dec 13, 2011 at 6:06 AM, John Laban <jo...@pagerduty.com> wrote:

> Ok, great.  I'll be sure to look into the virtualization-specific NTP
> guides.
>
> Another benefit of using Cassandra over Zookeeper for locking is that you
> don't have to worry about losing your connection to Zookeeper (and with it
> your locks) while hammering away at data in Cassandra.  If using Cassandra
> for locks, if you lose your locks you lose your connection to the datastore
> too.   (We're using long-ish session timeouts + connection listeners in ZK
> to mitigate that now.)
>
> John
>
>
>
> On Mon, Dec 12, 2011 at 12:55 PM, Dominic Williams <
> dwilliams@fightmymonster.com> wrote:
>
>> Hi John,
>>
>> On 12 December 2011 19:35, John Laban <jo...@pagerduty.com> wrote:
>>>
>>> So I responded to your algorithm in another part of this thread (very
>>> interesting) but this part of the paper caught my attention:
>>>
>>> > When client application code releases a lock, that lock must not
>>> actually be
>>> > released for a period equal to one millisecond plus twice the maximum
>>> possible
>>> > drift of the clocks in the client computers accessing the Cassandra
>>> databases
>>>
>>> I've been worried about this, and added some arbitrary delay in the
>>> releasing of my locks.  But I don't like it as it's (A) an arbitrary value
>>> and (B) it will - perhaps greatly - reduce the throughput of the more
>>> high-contention areas of my system.
>>>
>>> To fix (B) I'll probably just have to try to get rid of locks all
>>> together in these high-contention areas.
>>>
>>> To fix (A), I'd need to know what the maximum possible drift of my
>>> clocks will be.  How did you determine this?  What value do you use, out of
>>> curiosity?  What does the network layout of your client machines look like?
>>>  (Are any of your hosts geographically separated or all running in the same
>>> DC?  What's the maximum latency between hosts?  etc?)  Do you monitor the
>>> clock skew on an ongoing basis?  Am I worrying too much?
>>>
>>
>> If you setup NTP carefully no machine should drift more than 4ms say. I
>> forget where, but you'll find the best documentation on how to make a
>> bullet-proof NTP setup on vendor sites for virtualization software (because
>> virtualization software can cause drift so NTP setup has to be just so)
>>
>> What this means is that, for example, to be really safe when a thread
>> releases a lock you should wait say 9ms. Some points:-
>> -- since the sleep is performed before release, an isolated operation
>> should not be delayed at all
>> -- only a waiting thread or a thread requesting a lock immediately it is
>> released will be delayed, and no extra CPU or memory load is involved
>> -- in practice for the vast majority of "application layer" data
>> operations this restriction will have no effect on overall performance as
>> experienced by a user, because such operations nearly always read and write
>> to data with limited scope, for example the data of two users involved in
>> some transaction
>> -- the clocks issue does mean that you can't really serialize access to
>> more broadly shared data where more than 5 or 10 such requests are made a
>> second, say, but in reality even if the extra 9ms sleep on release wasn't
>> necessary, variability in database operation execution time (say under
>> load, or when something goes wrong) means trouble might occur serializing
>> with that level of contention
>>
>> So in summary, although this drift thing seems bad at first, partly
>> because it is a new consideration, in practice it's no big deal so long as
>> you look after your clocks (and the main issue to watch out for is when
>> application nodes running on virtualization software, hypervisors et al
>> have setup issues that make their clocks drift under load, and it is a good
>> idea to be wary of that)
>>
>> Best, Dominic
>>
>>
>>> Sorry for all the questions but I'm very concerned about this particular
>>> problem :)
>>>
>>> Thanks,
>>> John
>>>
>>>
>>> On Mon, Dec 12, 2011 at 4:36 AM, Dominic Williams <
>>> dwilliams@fightmymonster.com> wrote:
>>>
>>>> Hi guys, just thought I'd chip in...
>>>>
>>>> Fight My Monster is still using Cages, which is working fine, but...
>>>>
>>>> I'm looking at using Cassandra to replace Cages/ZooKeeper(!) There are
>>>> 2 main reasons:-
>>>>
>>>> 1. Although a fast ZooKeeper cluster can handle a lot of load (we
>>>> aren't getting anywhere near to capacity and we do a *lot*
>>>> of serialisation) at some point it will be necessary to start hashing lock
>>>> paths onto separate ZooKeeper clusters, and I tend to believe that these
>>>> days you should choose platforms that handle sharding themselves (e.g.
>>>> choose Cassandra rather than MySQL)
>>>>
>>>> 2. Why have more components in your system when you can have less!!!
>>>> KISS
>>>>
>>>> Recently I therefore tried to devise an algorithm which can be used to
>>>> add a distributed locking layer to clients such as Pelops, Hector, Pycassa
>>>> etc.
>>>>
>>>> There is a doc describing the algorithm, to which may be added an
>>>> appendix describing a protocol so that locking can be interoperable between
>>>> the clients. That could be extended to describe a protocol for
>>>> transactions. Word of warning this is a *beta* algorithm that has only been
>>>> seen by a select group so far, and therefore not even 100% sure it works
>>>> but there is a useful general discussion regarding serialization of
>>>> reads/writes so I include it anyway (and since this algorithm is going to
>>>> be out there now, if there's anyone out there who fancies doing a Z proof
>>>> or disproof, that would be fantastic).
>>>> http://media.fightmymonster.com/Shared/docs/Wait%20Chain%20Algorithm.pdf
>>>>
>>>> Final word on this re transactions: if/when transactions are added to
>>>> locking system in Pelops/Hector/Pycassa, Cassandra will provide better
>>>> performance than ZooKeeper for storing snapshots, especially as transaction
>>>> size increases
>>>>
>>>> Best, Dominic
>>>>
>>>> On 11 December 2011 01:53, Guy Incognito <dn...@gmail.com> wrote:
>>>>
>>>>>  you could try writing with the clock of the initial replay entry?
>>>>>
>>>>> On 06/12/2011 20:26, John Laban wrote:
>>>>>
>>>>> Ah, neat.  It is similar to what was proposed in (4) above with adding
>>>>> transactions to Cages, but instead of snapshotting the data to be rolled
>>>>> back (the "before" data), you snapshot the data to be replayed (the "after"
>>>>> data).  And then later, if you find that the transaction didn't complete,
>>>>> you just keep replaying the transaction until it takes.
>>>>>
>>>>>  The part I don't understand with this approach though:  how do you
>>>>> ensure that someone else didn't change the data between your initial failed
>>>>> transaction and the later replaying of the transaction?  You could get lost
>>>>> writes in that situation.
>>>>>
>>>>>  Dominic (in the Cages blog post) explained a workaround with that
>>>>> for his rollback proposal:  all subsequent readers or writers of that data
>>>>> would have to check for abandoned transactions and roll them back
>>>>> themselves before they could read the data.  I don't think this is possible
>>>>> with the XACT_LOG "replay" approach in these slides though, based on how
>>>>> the data is indexed (cassandra node token + timeUUID).
>>>>>
>>>>>
>>>>>  PS:  How are you liking Cages?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2011/12/6 Jérémy SEVELLEC <js...@gmail.com>
>>>>>
>>>>>> Hi John,
>>>>>>
>>>>>>  I had exactly the same reflexions.
>>>>>>
>>>>>>  I'm using zookeeper and cage to lock et isolate.
>>>>>>
>>>>>>  but how to rollback?
>>>>>> It's impossible so try replay!
>>>>>>
>>>>>>  the idea is explained in this presentation
>>>>>> http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
>>>>>> from slide 24)
>>>>>>
>>>>>>  - insert your whole data into one column
>>>>>> - make the job
>>>>>> - remove (or expire) your column.
>>>>>>
>>>>>>  if there is a problem during "making the job", you keep the
>>>>>> possibility to replay and replay and replay (synchronously or in a batch).
>>>>>>
>>>>>>  Regards
>>>>>>
>>>>>>  Jérémy
>>>>>>
>>>>>>
>>>>>> 2011/12/5 John Laban <jo...@pagerduty.com>
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>>  I'm building a system using Cassandra as a datastore and I have a
>>>>>>> few places where I am need of transactions.
>>>>>>>
>>>>>>>  I'm using ZooKeeper to provide locking when I'm in need of some
>>>>>>> concurrency control or isolation, so that solves that half of the puzzle.
>>>>>>>
>>>>>>>  What I need now is to sometimes be able to get atomicity across
>>>>>>> multiple writes by simulating the "begin/rollback/commit" abilities of a
>>>>>>> relational DB.  In other words, there are places where I need to perform
>>>>>>> multiple updates/inserts, and if I fail partway through, I would ideally be
>>>>>>> able to rollback the partially-applied updates.
>>>>>>>
>>>>>>>  Now, I *know* this isn't possible with Cassandra.  What I'm
>>>>>>> looking for are all the best practices, or at least tips and tricks, so
>>>>>>> that I can get around this limitation in Cassandra and still maintain a
>>>>>>> consistent datastore.  (I am using quorum reads/writes so that eventual
>>>>>>> consistency doesn't kick my ass here as well.)
>>>>>>>
>>>>>>>  Below are some ideas I've been able to dig up.  Please let me know
>>>>>>> if any of them don't make sense, or if there are better approaches:
>>>>>>>
>>>>>>>
>>>>>>>  1) Updates to a row in a column family are atomic.  So try to
>>>>>>> model your data so that you would only ever need to update a single row in
>>>>>>> a single CF at once.  Essentially, you model your data around transactions.
>>>>>>>  This is tricky but can certainly be done in some situations.
>>>>>>>
>>>>>>>  2) If you are only dealing with multiple row *inserts* (and not
>>>>>>> updates), have one of the rows act as a 'commit' by essentially validating
>>>>>>> the presence of the other rows.  For example, say you were performing an
>>>>>>> operation where you wanted to create an Account row and 5 User rows all at
>>>>>>> once (this is an unlikely example, but bear with me).  You could insert 5
>>>>>>> rows into the Users CF, and then the 1 row into the Accounts CF, which acts
>>>>>>> as the commit.  If something went wrong before the Account could be
>>>>>>> created, any Users that had been created so far would be orphaned and
>>>>>>> unusable, as your business logic can ensure that they can't exist without
>>>>>>> an Account.  You could also have an offline cleanup process that swept away
>>>>>>> orphans.
>>>>>>>
>>>>>>>  3) Try to model your updates as idempotent column inserts instead.
>>>>>>>  How do you model updates as inserts?  Instead of munging the value
>>>>>>> directly, you could insert a column containing the operation you want to
>>>>>>> perform (like "+5").  It would work kind of like the Consistent Vote
>>>>>>> Counting implementation: ( https://gist.github.com/416666 ).  How
>>>>>>> do you make the inserts idempotent?  Make sure the column names correspond
>>>>>>> to a request ID or some other identifier that would be identical across
>>>>>>> re-drives of a given (perhaps originally failed) request.  This could leave
>>>>>>> your datastore in a temporarily inconsistent state, but would eventually
>>>>>>> become consistent after a successful re-drive of the original request.
>>>>>>>
>>>>>>>  4) You could take an approach like Dominic Williams proposed with
>>>>>>> Cages:
>>>>>>> http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/   The gist is that you snapshot all the original values that you're about
>>>>>>> to munge somewhere else (in his case, ZooKeeper), make your updates, and
>>>>>>> then delete the snapshot (and that delete needs to be atomic).  If the
>>>>>>> snapshot data was never deleted, then subsequent accessors (even readers)
>>>>>>> of the data rows need to do the rollback of the previous transaction
>>>>>>> themselves before they can read/write this data.  They do the rollback by
>>>>>>> just overwriting the current values with what is in the snapshot.  It
>>>>>>> offloads the work of the rollback to the next worker that accesses the
>>>>>>> data.  This approach probably needs an generic/high-level programming layer
>>>>>>> to handle all of the details and complexity, and it doesn't seem like it
>>>>>>> was ever added to Cages.
>>>>>>>
>>>>>>>
>>>>>>>  Are there other approaches or best practices that I missed?  I
>>>>>>> would be very interested in hearing any opinions from those who have
>>>>>>> tackled these problems before.
>>>>>>>
>>>>>>>  Thanks!
>>>>>>>  John
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>   --
>>>>>> Jérémy
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: best practices for simulating transactions in Cassandra

Posted by John Laban <jo...@pagerduty.com>.

Ok, great.  I'll be sure to look into the virtualization-specific NTP
guides.

Another benefit of using Cassandra over Zookeeper for locking is that you
don't have to worry about losing your connection to Zookeeper (and with it
your locks) while hammering away at data in Cassandra.  If using Cassandra
for locks, if you lose your locks you lose your connection to the datastore
too.   (We're using long-ish session timeouts + connection listeners in ZK
to mitigate that now.)

John



On Mon, Dec 12, 2011 at 12:55 PM, Dominic Williams <
dwilliams@fightmymonster.com> wrote:

> Hi John,
>
> On 12 December 2011 19:35, John Laban <jo...@pagerduty.com> wrote:
>>
>> So I responded to your algorithm in another part of this thread (very
>> interesting) but this part of the paper caught my attention:
>>
>> > When client application code releases a lock, that lock must not
>> actually be
>> > released for a period equal to one millisecond plus twice the maximum
>> possible
>> > drift of the clocks in the client computers accessing the Cassandra
>> databases
>>
>> I've been worried about this, and added some arbitrary delay in the
>> releasing of my locks.  But I don't like it as it's (A) an arbitrary value
>> and (B) it will - perhaps greatly - reduce the throughput of the more
>> high-contention areas of my system.
>>
>> To fix (B) I'll probably just have to try to get rid of locks all
>> together in these high-contention areas.
>>
>> To fix (A), I'd need to know what the maximum possible drift of my clocks
>> will be.  How did you determine this?  What value do you use, out of
>> curiosity?  What does the network layout of your client machines look like?
>>  (Are any of your hosts geographically separated or all running in the same
>> DC?  What's the maximum latency between hosts?  etc?)  Do you monitor the
>> clock skew on an ongoing basis?  Am I worrying too much?
>>
>
> If you setup NTP carefully no machine should drift more than 4ms say. I
> forget where, but you'll find the best documentation on how to make a
> bullet-proof NTP setup on vendor sites for virtualization software (because
> virtualization software can cause drift so NTP setup has to be just so)
>
> What this means is that, for example, to be really safe when a thread
> releases a lock you should wait say 9ms. Some points:-
> -- since the sleep is performed before release, an isolated operation
> should not be delayed at all
> -- only a waiting thread or a thread requesting a lock immediately it is
> released will be delayed, and no extra CPU or memory load is involved
> -- in practice for the vast majority of "application layer" data
> operations this restriction will have no effect on overall performance as
> experienced by a user, because such operations nearly always read and write
> to data with limited scope, for example the data of two users involved in
> some transaction
> -- the clocks issue does mean that you can't really serialize access to
> more broadly shared data where more than 5 or 10 such requests are made a
> second, say, but in reality even if the extra 9ms sleep on release wasn't
> necessary, variability in database operation execution time (say under
> load, or when something goes wrong) means trouble might occur serializing
> with that level of contention
>
> So in summary, although this drift thing seems bad at first, partly
> because it is a new consideration, in practice it's no big deal so long as
> you look after your clocks (and the main issue to watch out for is when
> application nodes running on virtualization software, hypervisors et al
> have setup issues that make their clocks drift under load, and it is a good
> idea to be wary of that)
>
> Best, Dominic
>
>
>> Sorry for all the questions but I'm very concerned about this particular
>> problem :)
>>
>> Thanks,
>> John
>>
>>
>> On Mon, Dec 12, 2011 at 4:36 AM, Dominic Williams <
>> dwilliams@fightmymonster.com> wrote:
>>
>>> Hi guys, just thought I'd chip in...
>>>
>>> Fight My Monster is still using Cages, which is working fine, but...
>>>
>>> I'm looking at using Cassandra to replace Cages/ZooKeeper(!) There are 2
>>> main reasons:-
>>>
>>> 1. Although a fast ZooKeeper cluster can handle a lot of load (we aren't
>>> getting anywhere near to capacity and we do a *lot* of serialisation) at
>>> some point it will be necessary to start hashing lock paths onto separate
>>> ZooKeeper clusters, and I tend to believe that these days you should choose
>>> platforms that handle sharding themselves (e.g. choose Cassandra rather
>>> than MySQL)
>>>
>>> 2. Why have more components in your system when you can have less!!! KISS
>>>
>>> Recently I therefore tried to devise an algorithm which can be used to
>>> add a distributed locking layer to clients such as Pelops, Hector, Pycassa
>>> etc.
>>>
>>> There is a doc describing the algorithm, to which may be added an
>>> appendix describing a protocol so that locking can be interoperable between
>>> the clients. That could be extended to describe a protocol for
>>> transactions. Word of warning this is a *beta* algorithm that has only been
>>> seen by a select group so far, and therefore not even 100% sure it works
>>> but there is a useful general discussion regarding serialization of
>>> reads/writes so I include it anyway (and since this algorithm is going to
>>> be out there now, if there's anyone out there who fancies doing a Z proof
>>> or disproof, that would be fantastic).
>>> http://media.fightmymonster.com/Shared/docs/Wait%20Chain%20Algorithm.pdf
>>>
>>> Final word on this re transactions: if/when transactions are added to
>>> locking system in Pelops/Hector/Pycassa, Cassandra will provide better
>>> performance than ZooKeeper for storing snapshots, especially as transaction
>>> size increases
>>>
>>> Best, Dominic
>>>
>>> On 11 December 2011 01:53, Guy Incognito <dn...@gmail.com> wrote:
>>>
>>>>  you could try writing with the clock of the initial replay entry?
>>>>
>>>> On 06/12/2011 20:26, John Laban wrote:
>>>>
>>>> Ah, neat.  It is similar to what was proposed in (4) above with adding
>>>> transactions to Cages, but instead of snapshotting the data to be rolled
>>>> back (the "before" data), you snapshot the data to be replayed (the "after"
>>>> data).  And then later, if you find that the transaction didn't complete,
>>>> you just keep replaying the transaction until it takes.
>>>>
>>>>  The part I don't understand with this approach though:  how do you
>>>> ensure that someone else didn't change the data between your initial failed
>>>> transaction and the later replaying of the transaction?  You could get lost
>>>> writes in that situation.
>>>>
>>>>  Dominic (in the Cages blog post) explained a workaround with that for
>>>> his rollback proposal:  all subsequent readers or writers of that data
>>>> would have to check for abandoned transactions and roll them back
>>>> themselves before they could read the data.  I don't think this is possible
>>>> with the XACT_LOG "replay" approach in these slides though, based on how
>>>> the data is indexed (cassandra node token + timeUUID).
>>>>
>>>>
>>>>  PS:  How are you liking Cages?
>>>>
>>>>
>>>>
>>>>
>>>> 2011/12/6 Jérémy SEVELLEC <js...@gmail.com>
>>>>
>>>>> Hi John,
>>>>>
>>>>>  I had exactly the same reflexions.
>>>>>
>>>>>  I'm using zookeeper and cage to lock et isolate.
>>>>>
>>>>>  but how to rollback?
>>>>> It's impossible so try replay!
>>>>>
>>>>>  the idea is explained in this presentation
>>>>> http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
>>>>> from slide 24)
>>>>>
>>>>>  - insert your whole data into one column
>>>>> - make the job
>>>>> - remove (or expire) your column.
>>>>>
>>>>>  if there is a problem during "making the job", you keep the
>>>>> possibility to replay and replay and replay (synchronously or in a batch).
>>>>>
>>>>>  Regards
>>>>>
>>>>>  Jérémy
>>>>>
>>>>>
>>>>> 2011/12/5 John Laban <jo...@pagerduty.com>
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>>  I'm building a system using Cassandra as a datastore and I have a
>>>>>> few places where I am need of transactions.
>>>>>>
>>>>>>  I'm using ZooKeeper to provide locking when I'm in need of some
>>>>>> concurrency control or isolation, so that solves that half of the puzzle.
>>>>>>
>>>>>>  What I need now is to sometimes be able to get atomicity across
>>>>>> multiple writes by simulating the "begin/rollback/commit" abilities of a
>>>>>> relational DB.  In other words, there are places where I need to perform
>>>>>> multiple updates/inserts, and if I fail partway through, I would ideally be
>>>>>> able to rollback the partially-applied updates.
>>>>>>
>>>>>>  Now, I *know* this isn't possible with Cassandra.  What I'm looking
>>>>>> for are all the best practices, or at least tips and tricks, so that I can
>>>>>> get around this limitation in Cassandra and still maintain a consistent
>>>>>> datastore.  (I am using quorum reads/writes so that eventual consistency
>>>>>> doesn't kick my ass here as well.)
>>>>>>
>>>>>>  Below are some ideas I've been able to dig up.  Please let me know
>>>>>> if any of them don't make sense, or if there are better approaches:
>>>>>>
>>>>>>
>>>>>>  1) Updates to a row in a column family are atomic.  So try to model
>>>>>> your data so that you would only ever need to update a single row in a
>>>>>> single CF at once.  Essentially, you model your data around transactions.
>>>>>>  This is tricky but can certainly be done in some situations.
>>>>>>
>>>>>>  2) If you are only dealing with multiple row *inserts* (and not
>>>>>> updates), have one of the rows act as a 'commit' by essentially validating
>>>>>> the presence of the other rows.  For example, say you were performing an
>>>>>> operation where you wanted to create an Account row and 5 User rows all at
>>>>>> once (this is an unlikely example, but bear with me).  You could insert 5
>>>>>> rows into the Users CF, and then the 1 row into the Accounts CF, which acts
>>>>>> as the commit.  If something went wrong before the Account could be
>>>>>> created, any Users that had been created so far would be orphaned and
>>>>>> unusable, as your business logic can ensure that they can't exist without
>>>>>> an Account.  You could also have an offline cleanup process that swept away
>>>>>> orphans.
>>>>>>
>>>>>>  3) Try to model your updates as idempotent column inserts instead.
>>>>>>  How do you model updates as inserts?  Instead of munging the value
>>>>>> directly, you could insert a column containing the operation you want to
>>>>>> perform (like "+5").  It would work kind of like the Consistent Vote
>>>>>> Counting implementation: ( https://gist.github.com/416666 ).  How do
>>>>>> you make the inserts idempotent?  Make sure the column names correspond to
>>>>>> a request ID or some other identifier that would be identical across
>>>>>> re-drives of a given (perhaps originally failed) request.  This could leave
>>>>>> your datastore in a temporarily inconsistent state, but would eventually
>>>>>> become consistent after a successful re-drive of the original request.
>>>>>>
>>>>>>  4) You could take an approach like Dominic Williams proposed with
>>>>>> Cages:
>>>>>> http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/   The gist is that you snapshot all the original values that you're about
>>>>>> to munge somewhere else (in his case, ZooKeeper), make your updates, and
>>>>>> then delete the snapshot (and that delete needs to be atomic).  If the
>>>>>> snapshot data was never deleted, then subsequent accessors (even readers)
>>>>>> of the data rows need to do the rollback of the previous transaction
>>>>>> themselves before they can read/write this data.  They do the rollback by
>>>>>> just overwriting the current values with what is in the snapshot.  It
>>>>>> offloads the work of the rollback to the next worker that accesses the
>>>>>> data.  This approach probably needs an generic/high-level programming layer
>>>>>> to handle all of the details and complexity, and it doesn't seem like it
>>>>>> was ever added to Cages.
>>>>>>
>>>>>>
>>>>>>  Are there other approaches or best practices that I missed?  I
>>>>>> would be very interested in hearing any opinions from those who have
>>>>>> tackled these problems before.
>>>>>>
>>>>>>  Thanks!
>>>>>>  John
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>   --
>>>>> Jérémy
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: best practices for simulating transactions in Cassandra

Posted by Dominic Williams <dw...@fightmymonster.com>.

Hi John,

On 12 December 2011 19:35, John Laban <jo...@pagerduty.com> wrote:
>
> So I responded to your algorithm in another part of this thread (very
> interesting) but this part of the paper caught my attention:
>
> > When client application code releases a lock, that lock must not
> actually be
> > released for a period equal to one millisecond plus twice the maximum
> possible
> > drift of the clocks in the client computers accessing the Cassandra
> databases
>
> I've been worried about this, and added some arbitrary delay in the
> releasing of my locks.  But I don't like it as it's (A) an arbitrary value
> and (B) it will - perhaps greatly - reduce the throughput of the more
> high-contention areas of my system.
>
> To fix (B) I'll probably just have to try to get rid of locks all together
> in these high-contention areas.
>
> To fix (A), I'd need to know what the maximum possible drift of my clocks
> will be.  How did you determine this?  What value do you use, out of
> curiosity?  What does the network layout of your client machines look like?
>  (Are any of your hosts geographically separated or all running in the same
> DC?  What's the maximum latency between hosts?  etc?)  Do you monitor the
> clock skew on an ongoing basis?  Am I worrying too much?
>

If you setup NTP carefully no machine should drift more than 4ms say. I
forget where, but you'll find the best documentation on how to make a
bullet-proof NTP setup on vendor sites for virtualization software (because
virtualization software can cause drift so NTP setup has to be just so)

What this means is that, for example, to be really safe when a thread
releases a lock you should wait say 9ms. Some points:-
-- since the sleep is performed before release, an isolated operation
should not be delayed at all
-- only a waiting thread or a thread requesting a lock immediately it is
released will be delayed, and no extra CPU or memory load is involved
-- in practice for the vast majority of "application layer" data operations
this restriction will have no effect on overall performance as experienced
by a user, because such operations nearly always read and write to data
with limited scope, for example the data of two users involved in some
transaction
-- the clocks issue does mean that you can't really serialize access to
more broadly shared data where more than 5 or 10 such requests are made a
second, say, but in reality even if the extra 9ms sleep on release wasn't
necessary, variability in database operation execution time (say under
load, or when something goes wrong) means trouble might occur serializing
with that level of contention

So in summary, although this drift thing seems bad at first, partly because
it is a new consideration, in practice it's no big deal so long as you look
after your clocks (and the main issue to watch out for is when application
nodes running on virtualization software, hypervisors et al have setup
issues that make their clocks drift under load, and it is a good idea to be
wary of that)

Best, Dominic


> Sorry for all the questions but I'm very concerned about this particular
> problem :)
>
> Thanks,
> John
>
>
> On Mon, Dec 12, 2011 at 4:36 AM, Dominic Williams <
> dwilliams@fightmymonster.com> wrote:
>
>> Hi guys, just thought I'd chip in...
>>
>> Fight My Monster is still using Cages, which is working fine, but...
>>
>> I'm looking at using Cassandra to replace Cages/ZooKeeper(!) There are 2
>> main reasons:-
>>
>> 1. Although a fast ZooKeeper cluster can handle a lot of load (we aren't
>> getting anywhere near to capacity and we do a *lot* of serialisation) at
>> some point it will be necessary to start hashing lock paths onto separate
>> ZooKeeper clusters, and I tend to believe that these days you should choose
>> platforms that handle sharding themselves (e.g. choose Cassandra rather
>> than MySQL)
>>
>> 2. Why have more components in your system when you can have less!!! KISS
>>
>> Recently I therefore tried to devise an algorithm which can be used to
>> add a distributed locking layer to clients such as Pelops, Hector, Pycassa
>> etc.
>>
>> There is a doc describing the algorithm, to which may be added an
>> appendix describing a protocol so that locking can be interoperable between
>> the clients. That could be extended to describe a protocol for
>> transactions. Word of warning this is a *beta* algorithm that has only been
>> seen by a select group so far, and therefore not even 100% sure it works
>> but there is a useful general discussion regarding serialization of
>> reads/writes so I include it anyway (and since this algorithm is going to
>> be out there now, if there's anyone out there who fancies doing a Z proof
>> or disproof, that would be fantastic).
>> http://media.fightmymonster.com/Shared/docs/Wait%20Chain%20Algorithm.pdf
>>
>> Final word on this re transactions: if/when transactions are added to
>> locking system in Pelops/Hector/Pycassa, Cassandra will provide better
>> performance than ZooKeeper for storing snapshots, especially as transaction
>> size increases
>>
>> Best, Dominic
>>
>> On 11 December 2011 01:53, Guy Incognito <dn...@gmail.com> wrote:
>>
>>>  you could try writing with the clock of the initial replay entry?
>>>
>>> On 06/12/2011 20:26, John Laban wrote:
>>>
>>> Ah, neat.  It is similar to what was proposed in (4) above with adding
>>> transactions to Cages, but instead of snapshotting the data to be rolled
>>> back (the "before" data), you snapshot the data to be replayed (the "after"
>>> data).  And then later, if you find that the transaction didn't complete,
>>> you just keep replaying the transaction until it takes.
>>>
>>>  The part I don't understand with this approach though:  how do you
>>> ensure that someone else didn't change the data between your initial failed
>>> transaction and the later replaying of the transaction?  You could get lost
>>> writes in that situation.
>>>
>>>  Dominic (in the Cages blog post) explained a workaround with that for
>>> his rollback proposal:  all subsequent readers or writers of that data
>>> would have to check for abandoned transactions and roll them back
>>> themselves before they could read the data.  I don't think this is possible
>>> with the XACT_LOG "replay" approach in these slides though, based on how
>>> the data is indexed (cassandra node token + timeUUID).
>>>
>>>
>>>  PS:  How are you liking Cages?
>>>
>>>
>>>
>>>
>>> 2011/12/6 Jérémy SEVELLEC <js...@gmail.com>
>>>
>>>> Hi John,
>>>>
>>>>  I had exactly the same reflexions.
>>>>
>>>>  I'm using zookeeper and cage to lock et isolate.
>>>>
>>>>  but how to rollback?
>>>> It's impossible so try replay!
>>>>
>>>>  the idea is explained in this presentation
>>>> http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
>>>> from slide 24)
>>>>
>>>>  - insert your whole data into one column
>>>> - make the job
>>>> - remove (or expire) your column.
>>>>
>>>>  if there is a problem during "making the job", you keep the
>>>> possibility to replay and replay and replay (synchronously or in a batch).
>>>>
>>>>  Regards
>>>>
>>>>  Jérémy
>>>>
>>>>
>>>> 2011/12/5 John Laban <jo...@pagerduty.com>
>>>>
>>>>> Hello,
>>>>>
>>>>>  I'm building a system using Cassandra as a datastore and I have a
>>>>> few places where I am need of transactions.
>>>>>
>>>>>  I'm using ZooKeeper to provide locking when I'm in need of some
>>>>> concurrency control or isolation, so that solves that half of the puzzle.
>>>>>
>>>>>  What I need now is to sometimes be able to get atomicity across
>>>>> multiple writes by simulating the "begin/rollback/commit" abilities of a
>>>>> relational DB.  In other words, there are places where I need to perform
>>>>> multiple updates/inserts, and if I fail partway through, I would ideally be
>>>>> able to rollback the partially-applied updates.
>>>>>
>>>>>  Now, I *know* this isn't possible with Cassandra.  What I'm looking
>>>>> for are all the best practices, or at least tips and tricks, so that I can
>>>>> get around this limitation in Cassandra and still maintain a consistent
>>>>> datastore.  (I am using quorum reads/writes so that eventual consistency
>>>>> doesn't kick my ass here as well.)
>>>>>
>>>>>  Below are some ideas I've been able to dig up.  Please let me know
>>>>> if any of them don't make sense, or if there are better approaches:
>>>>>
>>>>>
>>>>>  1) Updates to a row in a column family are atomic.  So try to model
>>>>> your data so that you would only ever need to update a single row in a
>>>>> single CF at once.  Essentially, you model your data around transactions.
>>>>>  This is tricky but can certainly be done in some situations.
>>>>>
>>>>>  2) If you are only dealing with multiple row *inserts* (and not
>>>>> updates), have one of the rows act as a 'commit' by essentially validating
>>>>> the presence of the other rows.  For example, say you were performing an
>>>>> operation where you wanted to create an Account row and 5 User rows all at
>>>>> once (this is an unlikely example, but bear with me).  You could insert 5
>>>>> rows into the Users CF, and then the 1 row into the Accounts CF, which acts
>>>>> as the commit.  If something went wrong before the Account could be
>>>>> created, any Users that had been created so far would be orphaned and
>>>>> unusable, as your business logic can ensure that they can't exist without
>>>>> an Account.  You could also have an offline cleanup process that swept away
>>>>> orphans.
>>>>>
>>>>>  3) Try to model your updates as idempotent column inserts instead.
>>>>>  How do you model updates as inserts?  Instead of munging the value
>>>>> directly, you could insert a column containing the operation you want to
>>>>> perform (like "+5").  It would work kind of like the Consistent Vote
>>>>> Counting implementation: ( https://gist.github.com/416666 ).  How do
>>>>> you make the inserts idempotent?  Make sure the column names correspond to
>>>>> a request ID or some other identifier that would be identical across
>>>>> re-drives of a given (perhaps originally failed) request.  This could leave
>>>>> your datastore in a temporarily inconsistent state, but would eventually
>>>>> become consistent after a successful re-drive of the original request.
>>>>>
>>>>>  4) You could take an approach like Dominic Williams proposed with
>>>>> Cages:
>>>>> http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/   The gist is that you snapshot all the original values that you're about
>>>>> to munge somewhere else (in his case, ZooKeeper), make your updates, and
>>>>> then delete the snapshot (and that delete needs to be atomic).  If the
>>>>> snapshot data was never deleted, then subsequent accessors (even readers)
>>>>> of the data rows need to do the rollback of the previous transaction
>>>>> themselves before they can read/write this data.  They do the rollback by
>>>>> just overwriting the current values with what is in the snapshot.  It
>>>>> offloads the work of the rollback to the next worker that accesses the
>>>>> data.  This approach probably needs an generic/high-level programming layer
>>>>> to handle all of the details and complexity, and it doesn't seem like it
>>>>> was ever added to Cages.
>>>>>
>>>>>
>>>>>  Are there other approaches or best practices that I missed?  I would
>>>>> be very interested in hearing any opinions from those who have tackled
>>>>> these problems before.
>>>>>
>>>>>  Thanks!
>>>>>  John
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>   --
>>>> Jérémy
>>>>
>>>
>>>
>>>
>>
>

Re: best practices for simulating transactions in Cassandra

Posted by John Laban <jo...@pagerduty.com>.

Hi Dominic,

So I responded to your algorithm in another part of this thread (very
interesting) but this part of the paper caught my attention:

> When client application code releases a lock, that lock must not actually
be
> released for a period equal to one millisecond plus twice the maximum
possible
> drift of the clocks in the client computers accessing the Cassandra
databases

I've been worried about this, and added some arbitrary delay in the
releasing of my locks.  But I don't like it as it's (A) an arbitrary value
and (B) it will - perhaps greatly - reduce the throughput of the more
high-contention areas of my system.

To fix (B) I'll probably just have to try to get rid of locks all together
in these high-contention areas.

To fix (A), I'd need to know what the maximum possible drift of my clocks
will be.  How did you determine this?  What value do you use, out of
curiosity?  What does the network layout of your client machines look like?
 (Are any of your hosts geographically separated or all running in the same
DC?  What's the maximum latency between hosts?  etc?)  Do you monitor the
clock skew on an ongoing basis?  Am I worrying too much?

Sorry for all the questions but I'm very concerned about this particular
problem :)

Thanks,
John


On Mon, Dec 12, 2011 at 4:36 AM, Dominic Williams <
dwilliams@fightmymonster.com> wrote:

> Hi guys, just thought I'd chip in...
>
> Fight My Monster is still using Cages, which is working fine, but...
>
> I'm looking at using Cassandra to replace Cages/ZooKeeper(!) There are 2
> main reasons:-
>
> 1. Although a fast ZooKeeper cluster can handle a lot of load (we aren't
> getting anywhere near to capacity and we do a *lot* of serialisation) at
> some point it will be necessary to start hashing lock paths onto separate
> ZooKeeper clusters, and I tend to believe that these days you should choose
> platforms that handle sharding themselves (e.g. choose Cassandra rather
> than MySQL)
>
> 2. Why have more components in your system when you can have less!!! KISS
>
> Recently I therefore tried to devise an algorithm which can be used to add
> a distributed locking layer to clients such as Pelops, Hector, Pycassa etc.
>
> There is a doc describing the algorithm, to which may be added an appendix
> describing a protocol so that locking can be interoperable between the
> clients. That could be extended to describe a protocol for transactions.
> Word of warning this is a *beta* algorithm that has only been seen by a
> select group so far, and therefore not even 100% sure it works but there is
> a useful general discussion regarding serialization of reads/writes so I
> include it anyway (and since this algorithm is going to be out there now,
> if there's anyone out there who fancies doing a Z proof or disproof, that
> would be fantastic).
> http://media.fightmymonster.com/Shared/docs/Wait%20Chain%20Algorithm.pdf
>
> Final word on this re transactions: if/when transactions are added to
> locking system in Pelops/Hector/Pycassa, Cassandra will provide better
> performance than ZooKeeper for storing snapshots, especially as transaction
> size increases
>
> Best, Dominic
>
> On 11 December 2011 01:53, Guy Incognito <dn...@gmail.com> wrote:
>
>>  you could try writing with the clock of the initial replay entry?
>>
>> On 06/12/2011 20:26, John Laban wrote:
>>
>> Ah, neat.  It is similar to what was proposed in (4) above with adding
>> transactions to Cages, but instead of snapshotting the data to be rolled
>> back (the "before" data), you snapshot the data to be replayed (the "after"
>> data).  And then later, if you find that the transaction didn't complete,
>> you just keep replaying the transaction until it takes.
>>
>>  The part I don't understand with this approach though:  how do you
>> ensure that someone else didn't change the data between your initial failed
>> transaction and the later replaying of the transaction?  You could get lost
>> writes in that situation.
>>
>>  Dominic (in the Cages blog post) explained a workaround with that for
>> his rollback proposal:  all subsequent readers or writers of that data
>> would have to check for abandoned transactions and roll them back
>> themselves before they could read the data.  I don't think this is possible
>> with the XACT_LOG "replay" approach in these slides though, based on how
>> the data is indexed (cassandra node token + timeUUID).
>>
>>
>>  PS:  How are you liking Cages?
>>
>>
>>
>>
>> 2011/12/6 Jérémy SEVELLEC <js...@gmail.com>
>>
>>> Hi John,
>>>
>>>  I had exactly the same reflexions.
>>>
>>>  I'm using zookeeper and cage to lock et isolate.
>>>
>>>  but how to rollback?
>>> It's impossible so try replay!
>>>
>>>  the idea is explained in this presentation
>>> http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
>>> from slide 24)
>>>
>>>  - insert your whole data into one column
>>> - make the job
>>> - remove (or expire) your column.
>>>
>>>  if there is a problem during "making the job", you keep the
>>> possibility to replay and replay and replay (synchronously or in a batch).
>>>
>>>  Regards
>>>
>>>  Jérémy
>>>
>>>
>>> 2011/12/5 John Laban <jo...@pagerduty.com>
>>>
>>>> Hello,
>>>>
>>>>  I'm building a system using Cassandra as a datastore and I have a few
>>>> places where I am need of transactions.
>>>>
>>>>  I'm using ZooKeeper to provide locking when I'm in need of some
>>>> concurrency control or isolation, so that solves that half of the puzzle.
>>>>
>>>>  What I need now is to sometimes be able to get atomicity across
>>>> multiple writes by simulating the "begin/rollback/commit" abilities of a
>>>> relational DB.  In other words, there are places where I need to perform
>>>> multiple updates/inserts, and if I fail partway through, I would ideally be
>>>> able to rollback the partially-applied updates.
>>>>
>>>>  Now, I *know* this isn't possible with Cassandra.  What I'm looking
>>>> for are all the best practices, or at least tips and tricks, so that I can
>>>> get around this limitation in Cassandra and still maintain a consistent
>>>> datastore.  (I am using quorum reads/writes so that eventual consistency
>>>> doesn't kick my ass here as well.)
>>>>
>>>>  Below are some ideas I've been able to dig up.  Please let me know if
>>>> any of them don't make sense, or if there are better approaches:
>>>>
>>>>
>>>>  1) Updates to a row in a column family are atomic.  So try to model
>>>> your data so that you would only ever need to update a single row in a
>>>> single CF at once.  Essentially, you model your data around transactions.
>>>>  This is tricky but can certainly be done in some situations.
>>>>
>>>>  2) If you are only dealing with multiple row *inserts* (and not
>>>> updates), have one of the rows act as a 'commit' by essentially validating
>>>> the presence of the other rows.  For example, say you were performing an
>>>> operation where you wanted to create an Account row and 5 User rows all at
>>>> once (this is an unlikely example, but bear with me).  You could insert 5
>>>> rows into the Users CF, and then the 1 row into the Accounts CF, which acts
>>>> as the commit.  If something went wrong before the Account could be
>>>> created, any Users that had been created so far would be orphaned and
>>>> unusable, as your business logic can ensure that they can't exist without
>>>> an Account.  You could also have an offline cleanup process that swept away
>>>> orphans.
>>>>
>>>>  3) Try to model your updates as idempotent column inserts instead.
>>>>  How do you model updates as inserts?  Instead of munging the value
>>>> directly, you could insert a column containing the operation you want to
>>>> perform (like "+5").  It would work kind of like the Consistent Vote
>>>> Counting implementation: ( https://gist.github.com/416666 ).  How do
>>>> you make the inserts idempotent?  Make sure the column names correspond to
>>>> a request ID or some other identifier that would be identical across
>>>> re-drives of a given (perhaps originally failed) request.  This could leave
>>>> your datastore in a temporarily inconsistent state, but would eventually
>>>> become consistent after a successful re-drive of the original request.
>>>>
>>>>  4) You could take an approach like Dominic Williams proposed with
>>>> Cages:
>>>> http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/   The gist is that you snapshot all the original values that you're about
>>>> to munge somewhere else (in his case, ZooKeeper), make your updates, and
>>>> then delete the snapshot (and that delete needs to be atomic).  If the
>>>> snapshot data was never deleted, then subsequent accessors (even readers)
>>>> of the data rows need to do the rollback of the previous transaction
>>>> themselves before they can read/write this data.  They do the rollback by
>>>> just overwriting the current values with what is in the snapshot.  It
>>>> offloads the work of the rollback to the next worker that accesses the
>>>> data.  This approach probably needs an generic/high-level programming layer
>>>> to handle all of the details and complexity, and it doesn't seem like it
>>>> was ever added to Cages.
>>>>
>>>>
>>>>  Are there other approaches or best practices that I missed?  I would
>>>> be very interested in hearing any opinions from those who have tackled
>>>> these problems before.
>>>>
>>>>  Thanks!
>>>>  John
>>>>
>>>>
>>>>
>>>
>>>
>>>   --
>>> Jérémy
>>>
>>
>>
>>
>

Re: best practices for simulating transactions in Cassandra

Posted by John Laban <jo...@pagerduty.com>.

Hey Jake,

So I guess my problem is that I've never really relied on NTP before to try
to guarantee consistency in my application.  Does it tend to work really
well in practice?  What's the maximum clock skew you can see even when
running NTP (especially if you're using more than one DC where you might
have some latency between clients)?

Reading your code, it looks like you sleep 100ms in between the column
insert and the subsequent read in order to mitigate clock skew.  That might
be enough to fix the problem there, and I think the algorithm makes sense
under the assumption that there won't be more than 100ms clock skew.

Is it possible for clients' clocks to deviate more than ~100ms when using
NTP?  Dominic mentioned elsewhere in this thread that clock skew might
often be an order of magnitude less, so 100ms seems safe.


> It inspects each column, that represents a different acquire attempt
> and compares those timestamps.  so if client A is skewed in the past
> but encounters a non-expiring column it knows the lock is taken.

If you *didn't* have that 100ms guard, there are ways in your algorithm for
two clients to both think they won the lock.  I've labelled the steps of
the interesting part:

   1) client writes to that row @ QUORUM a column name of it's ID with a
TTL of N seconds
   2) client instantly reads back the entire row @ QUORUM
   3) if client encounters a column that is non-expiring then the lock is
already acquired.
   4) if client encounters a non-deleted but expiring column with a
timestamp < the one it wrote then it sleeps and tries again.
   5) if clients own timestamp was the earliest then it has won the lock
and writes a non-expiring column of the same name to mark it as officially
locked.

Lets assume client A's clock is ahead of client B:

A does step 1 and 2
B does step 1 and 2
A does steps 3, 4, and 5 (it has the only column it read back in, so it
assumes it wins)
B does steps 3, 4, and 5 (it has the youngest column because of clock skew,
so it assumes it wins)

(Or did I made a mistake somewhere?)

Anyway, like I said, it should work if NTP keeps the clocks nice and tight.

John




On Mon, Dec 12, 2011 at 11:54 AM, Jake Luciani <ja...@gmail.com> wrote:

>
>> Jake:  The algorithm you've outlined is pretty similar to how Zookeeper
>> clients implement locking.  The potential only issue that I see with it
>> implemented in Cassandra is that it uses the timestamps of the inserted
>> columns to determine the winner of the lock.  The column timestamps are
>> generated by the clients (whose clocks can drift from each other), so its
>> possible for a client (whose clock is skewed to some time in the near past)
>> to accidentally "steal" a lock from another client who presently thinks
>> that it is the winner of the lock.  At least it seems that way to me.
>>
>>
> I don't see that. if a client wants to abuse the system or doesn't run NTP
> then it can grab all the locks. but each lock is guaranteed to be owned by
> one person. since the client timestamps are used to pick a winner, see
> point 4 and 5
>
> It inspects each column, that represents a different acquire attempt and
> compares those timestamps.  so if client A is skewed in the past but
> encounters a non-expiring column it knows the lock is taken.
>
> -Jake
>
>
>
>> Dominic:  I'll have to read-read your paper a few times (while furrowing
>> my brow and scratching my head) before I can convince myself that the
>> proposed algorithm doesn't have the possibility of deadlock or livelock.
>>  It does seem that you have covered a lot of the bases though.
>>
>> Thanks for sharing guys :)
>> John
>>
>>
>> On Mon, Dec 12, 2011 at 6:21 AM, Jake Luciani <ja...@gmail.com> wrote:
>>
>>> I've written a locking mechanism for Solandra  (I refer to it as a
>>> reservation system) which basically allows you to acquire a lock.  This is
>>> used to ensure a node is service unique sequential IDs for lucene.
>>>
>>> It sounds a bit similar to Dominic's description but I'll explain how
>>> the Solandra one works.
>>>
>>> The code is at
>>> https://github.com/tjake/Solandra/blob/solandra/src/lucandra/cluster/CassandraIndexManager.java#L714
>>>
>>> The algorithm is basically:
>>>
>>>    - each node has a unique id.
>>>    - a lock name is a row key
>>>    - client writes to that row @ QUORUM a column name of it's ID with a
>>> TTL of N seconds
>>>    - client instantly reads back the entire row @ QUORUM
>>>    - if client encounters a column that is non-expiring then the lock is
>>> already acquired.
>>>    - if client encounters a non-deleted but expiring column with a
>>> timestamp < the one it wrote then it sleeps and tries again.
>>>    - if clients own timestamp was the earliest then it has won the lock
>>> and writes a non-expiring column of the same name to mark it as officially
>>> locked.
>>>    - in the case of a tie (2 columns with same ts the uuids are sorted
>>> and the lesser one wins)
>>>    - once finished, node with the lock deletes the column and frees the
>>> lock.
>>>
>>> This algorithm allows for deadlocks because the client has a huge number
>>> of locks to work with.  It would be fairly simple to use a TTL again to
>>> make locks auto expire after N seconds, this would make it more like google
>>> chubby.
>>>
>>> It also allows for bad clients to game the system but that's not
>>> something that could be dealt with using authorization apis.
>>>
>>> For legacy reasons the linked code uses super columns but a regular
>>> column family will work just fine.
>>>
>>> -Jake
>>>
>>>
>>> On Mon, Dec 12, 2011 at 7:36 AM, Dominic Williams <
>>> dwilliams@fightmymonster.com> wrote:
>>>
>>>> Hi guys, just thought I'd chip in...
>>>>
>>>> Fight My Monster is still using Cages, which is working fine, but...
>>>>
>>>> I'm looking at using Cassandra to replace Cages/ZooKeeper(!) There are
>>>> 2 main reasons:-
>>>>
>>>> 1. Although a fast ZooKeeper cluster can handle a lot of load (we
>>>> aren't getting anywhere near to capacity and we do a *lot*
>>>> of serialisation) at some point it will be necessary to start hashing lock
>>>> paths onto separate ZooKeeper clusters, and I tend to believe that these
>>>> days you should choose platforms that handle sharding themselves (e.g.
>>>> choose Cassandra rather than MySQL)
>>>>
>>>> 2. Why have more components in your system when you can have less!!!
>>>> KISS
>>>>
>>>> Recently I therefore tried to devise an algorithm which can be used to
>>>> add a distributed locking layer to clients such as Pelops, Hector, Pycassa
>>>> etc.
>>>>
>>>> There is a doc describing the algorithm, to which may be added an
>>>> appendix describing a protocol so that locking can be interoperable between
>>>> the clients. That could be extended to describe a protocol for
>>>> transactions. Word of warning this is a *beta* algorithm that has only been
>>>> seen by a select group so far, and therefore not even 100% sure it works
>>>> but there is a useful general discussion regarding serialization of
>>>> reads/writes so I include it anyway (and since this algorithm is going to
>>>> be out there now, if there's anyone out there who fancies doing a Z proof
>>>> or disproof, that would be fantastic).
>>>> http://media.fightmymonster.com/Shared/docs/Wait%20Chain%20Algorithm.pdf
>>>>
>>>> Final word on this re transactions: if/when transactions are added to
>>>> locking system in Pelops/Hector/Pycassa, Cassandra will provide better
>>>> performance than ZooKeeper for storing snapshots, especially as transaction
>>>> size increases
>>>>
>>>> Best, Dominic
>>>>
>>>> On 11 December 2011 01:53, Guy Incognito <dn...@gmail.com> wrote:
>>>>
>>>>>  you could try writing with the clock of the initial replay entry?
>>>>>
>>>>> On 06/12/2011 20:26, John Laban wrote:
>>>>>
>>>>> Ah, neat.  It is similar to what was proposed in (4) above with adding
>>>>> transactions to Cages, but instead of snapshotting the data to be rolled
>>>>> back (the "before" data), you snapshot the data to be replayed (the "after"
>>>>> data).  And then later, if you find that the transaction didn't complete,
>>>>> you just keep replaying the transaction until it takes.
>>>>>
>>>>>  The part I don't understand with this approach though:  how do you
>>>>> ensure that someone else didn't change the data between your initial failed
>>>>> transaction and the later replaying of the transaction?  You could get lost
>>>>> writes in that situation.
>>>>>
>>>>>  Dominic (in the Cages blog post) explained a workaround with that
>>>>> for his rollback proposal:  all subsequent readers or writers of that data
>>>>> would have to check for abandoned transactions and roll them back
>>>>> themselves before they could read the data.  I don't think this is possible
>>>>> with the XACT_LOG "replay" approach in these slides though, based on how
>>>>> the data is indexed (cassandra node token + timeUUID).
>>>>>
>>>>>
>>>>>  PS:  How are you liking Cages?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2011/12/6 Jérémy SEVELLEC <js...@gmail.com>
>>>>>
>>>>>> Hi John,
>>>>>>
>>>>>>  I had exactly the same reflexions.
>>>>>>
>>>>>>  I'm using zookeeper and cage to lock et isolate.
>>>>>>
>>>>>>  but how to rollback?
>>>>>> It's impossible so try replay!
>>>>>>
>>>>>>  the idea is explained in this presentation
>>>>>> http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
>>>>>> from slide 24)
>>>>>>
>>>>>>  - insert your whole data into one column
>>>>>> - make the job
>>>>>> - remove (or expire) your column.
>>>>>>
>>>>>>  if there is a problem during "making the job", you keep the
>>>>>> possibility to replay and replay and replay (synchronously or in a batch).
>>>>>>
>>>>>>  Regards
>>>>>>
>>>>>>  Jérémy
>>>>>>
>>>>>>
>>>>>> 2011/12/5 John Laban <jo...@pagerduty.com>
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>>  I'm building a system using Cassandra as a datastore and I have a
>>>>>>> few places where I am need of transactions.
>>>>>>>
>>>>>>>  I'm using ZooKeeper to provide locking when I'm in need of some
>>>>>>> concurrency control or isolation, so that solves that half of the puzzle.
>>>>>>>
>>>>>>>  What I need now is to sometimes be able to get atomicity across
>>>>>>> multiple writes by simulating the "begin/rollback/commit" abilities of a
>>>>>>> relational DB.  In other words, there are places where I need to perform
>>>>>>> multiple updates/inserts, and if I fail partway through, I would ideally be
>>>>>>> able to rollback the partially-applied updates.
>>>>>>>
>>>>>>>  Now, I *know* this isn't possible with Cassandra.  What I'm
>>>>>>> looking for are all the best practices, or at least tips and tricks, so
>>>>>>> that I can get around this limitation in Cassandra and still maintain a
>>>>>>> consistent datastore.  (I am using quorum reads/writes so that eventual
>>>>>>> consistency doesn't kick my ass here as well.)
>>>>>>>
>>>>>>>  Below are some ideas I've been able to dig up.  Please let me know
>>>>>>> if any of them don't make sense, or if there are better approaches:
>>>>>>>
>>>>>>>
>>>>>>>  1) Updates to a row in a column family are atomic.  So try to
>>>>>>> model your data so that you would only ever need to update a single row in
>>>>>>> a single CF at once.  Essentially, you model your data around transactions.
>>>>>>>  This is tricky but can certainly be done in some situations.
>>>>>>>
>>>>>>>  2) If you are only dealing with multiple row *inserts* (and not
>>>>>>> updates), have one of the rows act as a 'commit' by essentially validating
>>>>>>> the presence of the other rows.  For example, say you were performing an
>>>>>>> operation where you wanted to create an Account row and 5 User rows all at
>>>>>>> once (this is an unlikely example, but bear with me).  You could insert 5
>>>>>>> rows into the Users CF, and then the 1 row into the Accounts CF, which acts
>>>>>>> as the commit.  If something went wrong before the Account could be
>>>>>>> created, any Users that had been created so far would be orphaned and
>>>>>>> unusable, as your business logic can ensure that they can't exist without
>>>>>>> an Account.  You could also have an offline cleanup process that swept away
>>>>>>> orphans.
>>>>>>>
>>>>>>>  3) Try to model your updates as idempotent column inserts instead.
>>>>>>>  How do you model updates as inserts?  Instead of munging the value
>>>>>>> directly, you could insert a column containing the operation you want to
>>>>>>> perform (like "+5").  It would work kind of like the Consistent Vote
>>>>>>> Counting implementation: ( https://gist.github.com/416666 ).  How
>>>>>>> do you make the inserts idempotent?  Make sure the column names correspond
>>>>>>> to a request ID or some other identifier that would be identical across
>>>>>>> re-drives of a given (perhaps originally failed) request.  This could leave
>>>>>>> your datastore in a temporarily inconsistent state, but would eventually
>>>>>>> become consistent after a successful re-drive of the original request.
>>>>>>>
>>>>>>>  4) You could take an approach like Dominic Williams proposed with
>>>>>>> Cages:
>>>>>>> http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/   The gist is that you snapshot all the original values that you're about
>>>>>>> to munge somewhere else (in his case, ZooKeeper), make your updates, and
>>>>>>> then delete the snapshot (and that delete needs to be atomic).  If the
>>>>>>> snapshot data was never deleted, then subsequent accessors (even readers)
>>>>>>> of the data rows need to do the rollback of the previous transaction
>>>>>>> themselves before they can read/write this data.  They do the rollback by
>>>>>>> just overwriting the current values with what is in the snapshot.  It
>>>>>>> offloads the work of the rollback to the next worker that accesses the
>>>>>>> data.  This approach probably needs an generic/high-level programming layer
>>>>>>> to handle all of the details and complexity, and it doesn't seem like it
>>>>>>> was ever added to Cages.
>>>>>>>
>>>>>>>
>>>>>>>  Are there other approaches or best practices that I missed?  I
>>>>>>> would be very interested in hearing any opinions from those who have
>>>>>>> tackled these problems before.
>>>>>>>
>>>>>>>  Thanks!
>>>>>>>  John
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>   --
>>>>>> Jérémy
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> http://twitter.com/tjake
>>>
>>
>>
>
>
> --
> http://twitter.com/tjake
>

Re: best practices for simulating transactions in Cassandra

Posted by Jake Luciani <ja...@gmail.com>.

>
>
> Jake:  The algorithm you've outlined is pretty similar to how Zookeeper
> clients implement locking.  The potential only issue that I see with it
> implemented in Cassandra is that it uses the timestamps of the inserted
> columns to determine the winner of the lock.  The column timestamps are
> generated by the clients (whose clocks can drift from each other), so its
> possible for a client (whose clock is skewed to some time in the near past)
> to accidentally "steal" a lock from another client who presently thinks
> that it is the winner of the lock.  At least it seems that way to me.
>
>
I don't see that. if a client wants to abuse the system or doesn't run NTP
then it can grab all the locks. but each lock is guaranteed to be owned by
one person. since the client timestamps are used to pick a winner, see
point 4 and 5

It inspects each column, that represents a different acquire attempt and
compares those timestamps.  so if client A is skewed in the past but
encounters a non-expiring column it knows the lock is taken.

-Jake



> Dominic:  I'll have to read-read your paper a few times (while furrowing
> my brow and scratching my head) before I can convince myself that the
> proposed algorithm doesn't have the possibility of deadlock or livelock.
>  It does seem that you have covered a lot of the bases though.
>
> Thanks for sharing guys :)
> John
>
>
> On Mon, Dec 12, 2011 at 6:21 AM, Jake Luciani <ja...@gmail.com> wrote:
>
>> I've written a locking mechanism for Solandra  (I refer to it as a
>> reservation system) which basically allows you to acquire a lock.  This is
>> used to ensure a node is service unique sequential IDs for lucene.
>>
>> It sounds a bit similar to Dominic's description but I'll explain how the
>> Solandra one works.
>>
>> The code is at
>> https://github.com/tjake/Solandra/blob/solandra/src/lucandra/cluster/CassandraIndexManager.java#L714
>>
>> The algorithm is basically:
>>
>>    - each node has a unique id.
>>    - a lock name is a row key
>>    - client writes to that row @ QUORUM a column name of it's ID with a
>> TTL of N seconds
>>    - client instantly reads back the entire row @ QUORUM
>>    - if client encounters a column that is non-expiring then the lock is
>> already acquired.
>>    - if client encounters a non-deleted but expiring column with a
>> timestamp < the one it wrote then it sleeps and tries again.
>>    - if clients own timestamp was the earliest then it has won the lock
>> and writes a non-expiring column of the same name to mark it as officially
>> locked.
>>    - in the case of a tie (2 columns with same ts the uuids are sorted
>> and the lesser one wins)
>>    - once finished, node with the lock deletes the column and frees the
>> lock.
>>
>> This algorithm allows for deadlocks because the client has a huge number
>> of locks to work with.  It would be fairly simple to use a TTL again to
>> make locks auto expire after N seconds, this would make it more like google
>> chubby.
>>
>> It also allows for bad clients to game the system but that's not
>> something that could be dealt with using authorization apis.
>>
>> For legacy reasons the linked code uses super columns but a regular
>> column family will work just fine.
>>
>> -Jake
>>
>>
>> On Mon, Dec 12, 2011 at 7:36 AM, Dominic Williams <
>> dwilliams@fightmymonster.com> wrote:
>>
>>> Hi guys, just thought I'd chip in...
>>>
>>> Fight My Monster is still using Cages, which is working fine, but...
>>>
>>> I'm looking at using Cassandra to replace Cages/ZooKeeper(!) There are 2
>>> main reasons:-
>>>
>>> 1. Although a fast ZooKeeper cluster can handle a lot of load (we aren't
>>> getting anywhere near to capacity and we do a *lot* of serialisation) at
>>> some point it will be necessary to start hashing lock paths onto separate
>>> ZooKeeper clusters, and I tend to believe that these days you should choose
>>> platforms that handle sharding themselves (e.g. choose Cassandra rather
>>> than MySQL)
>>>
>>> 2. Why have more components in your system when you can have less!!! KISS
>>>
>>> Recently I therefore tried to devise an algorithm which can be used to
>>> add a distributed locking layer to clients such as Pelops, Hector, Pycassa
>>> etc.
>>>
>>> There is a doc describing the algorithm, to which may be added an
>>> appendix describing a protocol so that locking can be interoperable between
>>> the clients. That could be extended to describe a protocol for
>>> transactions. Word of warning this is a *beta* algorithm that has only been
>>> seen by a select group so far, and therefore not even 100% sure it works
>>> but there is a useful general discussion regarding serialization of
>>> reads/writes so I include it anyway (and since this algorithm is going to
>>> be out there now, if there's anyone out there who fancies doing a Z proof
>>> or disproof, that would be fantastic).
>>> http://media.fightmymonster.com/Shared/docs/Wait%20Chain%20Algorithm.pdf
>>>
>>> Final word on this re transactions: if/when transactions are added to
>>> locking system in Pelops/Hector/Pycassa, Cassandra will provide better
>>> performance than ZooKeeper for storing snapshots, especially as transaction
>>> size increases
>>>
>>> Best, Dominic
>>>
>>> On 11 December 2011 01:53, Guy Incognito <dn...@gmail.com> wrote:
>>>
>>>>  you could try writing with the clock of the initial replay entry?
>>>>
>>>> On 06/12/2011 20:26, John Laban wrote:
>>>>
>>>> Ah, neat.  It is similar to what was proposed in (4) above with adding
>>>> transactions to Cages, but instead of snapshotting the data to be rolled
>>>> back (the "before" data), you snapshot the data to be replayed (the "after"
>>>> data).  And then later, if you find that the transaction didn't complete,
>>>> you just keep replaying the transaction until it takes.
>>>>
>>>>  The part I don't understand with this approach though:  how do you
>>>> ensure that someone else didn't change the data between your initial failed
>>>> transaction and the later replaying of the transaction?  You could get lost
>>>> writes in that situation.
>>>>
>>>>  Dominic (in the Cages blog post) explained a workaround with that for
>>>> his rollback proposal:  all subsequent readers or writers of that data
>>>> would have to check for abandoned transactions and roll them back
>>>> themselves before they could read the data.  I don't think this is possible
>>>> with the XACT_LOG "replay" approach in these slides though, based on how
>>>> the data is indexed (cassandra node token + timeUUID).
>>>>
>>>>
>>>>  PS:  How are you liking Cages?
>>>>
>>>>
>>>>
>>>>
>>>> 2011/12/6 Jérémy SEVELLEC <js...@gmail.com>
>>>>
>>>>> Hi John,
>>>>>
>>>>>  I had exactly the same reflexions.
>>>>>
>>>>>  I'm using zookeeper and cage to lock et isolate.
>>>>>
>>>>>  but how to rollback?
>>>>> It's impossible so try replay!
>>>>>
>>>>>  the idea is explained in this presentation
>>>>> http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
>>>>> from slide 24)
>>>>>
>>>>>  - insert your whole data into one column
>>>>> - make the job
>>>>> - remove (or expire) your column.
>>>>>
>>>>>  if there is a problem during "making the job", you keep the
>>>>> possibility to replay and replay and replay (synchronously or in a batch).
>>>>>
>>>>>  Regards
>>>>>
>>>>>  Jérémy
>>>>>
>>>>>
>>>>> 2011/12/5 John Laban <jo...@pagerduty.com>
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>>  I'm building a system using Cassandra as a datastore and I have a
>>>>>> few places where I am need of transactions.
>>>>>>
>>>>>>  I'm using ZooKeeper to provide locking when I'm in need of some
>>>>>> concurrency control or isolation, so that solves that half of the puzzle.
>>>>>>
>>>>>>  What I need now is to sometimes be able to get atomicity across
>>>>>> multiple writes by simulating the "begin/rollback/commit" abilities of a
>>>>>> relational DB.  In other words, there are places where I need to perform
>>>>>> multiple updates/inserts, and if I fail partway through, I would ideally be
>>>>>> able to rollback the partially-applied updates.
>>>>>>
>>>>>>  Now, I *know* this isn't possible with Cassandra.  What I'm looking
>>>>>> for are all the best practices, or at least tips and tricks, so that I can
>>>>>> get around this limitation in Cassandra and still maintain a consistent
>>>>>> datastore.  (I am using quorum reads/writes so that eventual consistency
>>>>>> doesn't kick my ass here as well.)
>>>>>>
>>>>>>  Below are some ideas I've been able to dig up.  Please let me know
>>>>>> if any of them don't make sense, or if there are better approaches:
>>>>>>
>>>>>>
>>>>>>  1) Updates to a row in a column family are atomic.  So try to model
>>>>>> your data so that you would only ever need to update a single row in a
>>>>>> single CF at once.  Essentially, you model your data around transactions.
>>>>>>  This is tricky but can certainly be done in some situations.
>>>>>>
>>>>>>  2) If you are only dealing with multiple row *inserts* (and not
>>>>>> updates), have one of the rows act as a 'commit' by essentially validating
>>>>>> the presence of the other rows.  For example, say you were performing an
>>>>>> operation where you wanted to create an Account row and 5 User rows all at
>>>>>> once (this is an unlikely example, but bear with me).  You could insert 5
>>>>>> rows into the Users CF, and then the 1 row into the Accounts CF, which acts
>>>>>> as the commit.  If something went wrong before the Account could be
>>>>>> created, any Users that had been created so far would be orphaned and
>>>>>> unusable, as your business logic can ensure that they can't exist without
>>>>>> an Account.  You could also have an offline cleanup process that swept away
>>>>>> orphans.
>>>>>>
>>>>>>  3) Try to model your updates as idempotent column inserts instead.
>>>>>>  How do you model updates as inserts?  Instead of munging the value
>>>>>> directly, you could insert a column containing the operation you want to
>>>>>> perform (like "+5").  It would work kind of like the Consistent Vote
>>>>>> Counting implementation: ( https://gist.github.com/416666 ).  How do
>>>>>> you make the inserts idempotent?  Make sure the column names correspond to
>>>>>> a request ID or some other identifier that would be identical across
>>>>>> re-drives of a given (perhaps originally failed) request.  This could leave
>>>>>> your datastore in a temporarily inconsistent state, but would eventually
>>>>>> become consistent after a successful re-drive of the original request.
>>>>>>
>>>>>>  4) You could take an approach like Dominic Williams proposed with
>>>>>> Cages:
>>>>>> http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/   The gist is that you snapshot all the original values that you're about
>>>>>> to munge somewhere else (in his case, ZooKeeper), make your updates, and
>>>>>> then delete the snapshot (and that delete needs to be atomic).  If the
>>>>>> snapshot data was never deleted, then subsequent accessors (even readers)
>>>>>> of the data rows need to do the rollback of the previous transaction
>>>>>> themselves before they can read/write this data.  They do the rollback by
>>>>>> just overwriting the current values with what is in the snapshot.  It
>>>>>> offloads the work of the rollback to the next worker that accesses the
>>>>>> data.  This approach probably needs an generic/high-level programming layer
>>>>>> to handle all of the details and complexity, and it doesn't seem like it
>>>>>> was ever added to Cages.
>>>>>>
>>>>>>
>>>>>>  Are there other approaches or best practices that I missed?  I
>>>>>> would be very interested in hearing any opinions from those who have
>>>>>> tackled these problems before.
>>>>>>
>>>>>>  Thanks!
>>>>>>  John
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>   --
>>>>> Jérémy
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> http://twitter.com/tjake
>>
>
>


-- 
http://twitter.com/tjake

Re: best practices for simulating transactions in Cassandra

Posted by John Laban <jo...@pagerduty.com>.

Dominic/Jake: very interesting.  This is getting more into fundamentals on
locking/isolation rather than transactions/atomicity, but it is still
relevant as I was going to use ZooKeeper for that stuff, but it would
certainly nice to KISS and remove a component from my setup if I can do
without it.


Jake:  The algorithm you've outlined is pretty similar to how Zookeeper
clients implement locking.  The potential only issue that I see with it
implemented in Cassandra is that it uses the timestamps of the inserted
columns to determine the winner of the lock.  The column timestamps are
generated by the clients (whose clocks can drift from each other), so its
possible for a client (whose clock is skewed to some time in the near past)
to accidentally "steal" a lock from another client who presently thinks
that it is the winner of the lock.  At least it seems that way to me.

Dominic's algorithm tries to get around this problem by using the client's
more reliable ability to "see" (with quorum consistency)
previously-inserted columns by other clients to determine the holder of the
lock, and it gets progressively more complicated as it deals with clients
that might be doing their sequential writes and reads in lockstep with each
other, etc.


Dominic:  I'll have to read-read your paper a few times (while furrowing my
brow and scratching my head) before I can convince myself that the proposed
algorithm doesn't have the possibility of deadlock or livelock.  It does
seem that you have covered a lot of the bases though.

Thanks for sharing guys :)
John


On Mon, Dec 12, 2011 at 6:21 AM, Jake Luciani <ja...@gmail.com> wrote:

> I've written a locking mechanism for Solandra  (I refer to it as a
> reservation system) which basically allows you to acquire a lock.  This is
> used to ensure a node is service unique sequential IDs for lucene.
>
> It sounds a bit similar to Dominic's description but I'll explain how the
> Solandra one works.
>
> The code is at
> https://github.com/tjake/Solandra/blob/solandra/src/lucandra/cluster/CassandraIndexManager.java#L714
>
> The algorithm is basically:
>
>    - each node has a unique id.
>    - a lock name is a row key
>    - client writes to that row @ QUORUM a column name of it's ID with a
> TTL of N seconds
>    - client instantly reads back the entire row @ QUORUM
>    - if client encounters a column that is non-expiring then the lock is
> already acquired.
>    - if client encounters a non-deleted but expiring column with a
> timestamp < the one it wrote then it sleeps and tries again.
>    - if clients own timestamp was the earliest then it has won the lock
> and writes a non-expiring column of the same name to mark it as officially
> locked.
>    - in the case of a tie (2 columns with same ts the uuids are sorted and
> the lesser one wins)
>    - once finished, node with the lock deletes the column and frees the
> lock.
>
> This algorithm allows for deadlocks because the client has a huge number
> of locks to work with.  It would be fairly simple to use a TTL again to
> make locks auto expire after N seconds, this would make it more like google
> chubby.
>
> It also allows for bad clients to game the system but that's not something
> that could be dealt with using authorization apis.
>
> For legacy reasons the linked code uses super columns but a regular column
> family will work just fine.
>
> -Jake
>
>
> On Mon, Dec 12, 2011 at 7:36 AM, Dominic Williams <
> dwilliams@fightmymonster.com> wrote:
>
>> Hi guys, just thought I'd chip in...
>>
>> Fight My Monster is still using Cages, which is working fine, but...
>>
>> I'm looking at using Cassandra to replace Cages/ZooKeeper(!) There are 2
>> main reasons:-
>>
>> 1. Although a fast ZooKeeper cluster can handle a lot of load (we aren't
>> getting anywhere near to capacity and we do a *lot* of serialisation) at
>> some point it will be necessary to start hashing lock paths onto separate
>> ZooKeeper clusters, and I tend to believe that these days you should choose
>> platforms that handle sharding themselves (e.g. choose Cassandra rather
>> than MySQL)
>>
>> 2. Why have more components in your system when you can have less!!! KISS
>>
>> Recently I therefore tried to devise an algorithm which can be used to
>> add a distributed locking layer to clients such as Pelops, Hector, Pycassa
>> etc.
>>
>> There is a doc describing the algorithm, to which may be added an
>> appendix describing a protocol so that locking can be interoperable between
>> the clients. That could be extended to describe a protocol for
>> transactions. Word of warning this is a *beta* algorithm that has only been
>> seen by a select group so far, and therefore not even 100% sure it works
>> but there is a useful general discussion regarding serialization of
>> reads/writes so I include it anyway (and since this algorithm is going to
>> be out there now, if there's anyone out there who fancies doing a Z proof
>> or disproof, that would be fantastic).
>> http://media.fightmymonster.com/Shared/docs/Wait%20Chain%20Algorithm.pdf
>>
>> Final word on this re transactions: if/when transactions are added to
>> locking system in Pelops/Hector/Pycassa, Cassandra will provide better
>> performance than ZooKeeper for storing snapshots, especially as transaction
>> size increases
>>
>> Best, Dominic
>>
>> On 11 December 2011 01:53, Guy Incognito <dn...@gmail.com> wrote:
>>
>>>  you could try writing with the clock of the initial replay entry?
>>>
>>> On 06/12/2011 20:26, John Laban wrote:
>>>
>>> Ah, neat.  It is similar to what was proposed in (4) above with adding
>>> transactions to Cages, but instead of snapshotting the data to be rolled
>>> back (the "before" data), you snapshot the data to be replayed (the "after"
>>> data).  And then later, if you find that the transaction didn't complete,
>>> you just keep replaying the transaction until it takes.
>>>
>>>  The part I don't understand with this approach though:  how do you
>>> ensure that someone else didn't change the data between your initial failed
>>> transaction and the later replaying of the transaction?  You could get lost
>>> writes in that situation.
>>>
>>>  Dominic (in the Cages blog post) explained a workaround with that for
>>> his rollback proposal:  all subsequent readers or writers of that data
>>> would have to check for abandoned transactions and roll them back
>>> themselves before they could read the data.  I don't think this is possible
>>> with the XACT_LOG "replay" approach in these slides though, based on how
>>> the data is indexed (cassandra node token + timeUUID).
>>>
>>>
>>>  PS:  How are you liking Cages?
>>>
>>>
>>>
>>>
>>> 2011/12/6 Jérémy SEVELLEC <js...@gmail.com>
>>>
>>>> Hi John,
>>>>
>>>>  I had exactly the same reflexions.
>>>>
>>>>  I'm using zookeeper and cage to lock et isolate.
>>>>
>>>>  but how to rollback?
>>>> It's impossible so try replay!
>>>>
>>>>  the idea is explained in this presentation
>>>> http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
>>>> from slide 24)
>>>>
>>>>  - insert your whole data into one column
>>>> - make the job
>>>> - remove (or expire) your column.
>>>>
>>>>  if there is a problem during "making the job", you keep the
>>>> possibility to replay and replay and replay (synchronously or in a batch).
>>>>
>>>>  Regards
>>>>
>>>>  Jérémy
>>>>
>>>>
>>>> 2011/12/5 John Laban <jo...@pagerduty.com>
>>>>
>>>>> Hello,
>>>>>
>>>>>  I'm building a system using Cassandra as a datastore and I have a
>>>>> few places where I am need of transactions.
>>>>>
>>>>>  I'm using ZooKeeper to provide locking when I'm in need of some
>>>>> concurrency control or isolation, so that solves that half of the puzzle.
>>>>>
>>>>>  What I need now is to sometimes be able to get atomicity across
>>>>> multiple writes by simulating the "begin/rollback/commit" abilities of a
>>>>> relational DB.  In other words, there are places where I need to perform
>>>>> multiple updates/inserts, and if I fail partway through, I would ideally be
>>>>> able to rollback the partially-applied updates.
>>>>>
>>>>>  Now, I *know* this isn't possible with Cassandra.  What I'm looking
>>>>> for are all the best practices, or at least tips and tricks, so that I can
>>>>> get around this limitation in Cassandra and still maintain a consistent
>>>>> datastore.  (I am using quorum reads/writes so that eventual consistency
>>>>> doesn't kick my ass here as well.)
>>>>>
>>>>>  Below are some ideas I've been able to dig up.  Please let me know
>>>>> if any of them don't make sense, or if there are better approaches:
>>>>>
>>>>>
>>>>>  1) Updates to a row in a column family are atomic.  So try to model
>>>>> your data so that you would only ever need to update a single row in a
>>>>> single CF at once.  Essentially, you model your data around transactions.
>>>>>  This is tricky but can certainly be done in some situations.
>>>>>
>>>>>  2) If you are only dealing with multiple row *inserts* (and not
>>>>> updates), have one of the rows act as a 'commit' by essentially validating
>>>>> the presence of the other rows.  For example, say you were performing an
>>>>> operation where you wanted to create an Account row and 5 User rows all at
>>>>> once (this is an unlikely example, but bear with me).  You could insert 5
>>>>> rows into the Users CF, and then the 1 row into the Accounts CF, which acts
>>>>> as the commit.  If something went wrong before the Account could be
>>>>> created, any Users that had been created so far would be orphaned and
>>>>> unusable, as your business logic can ensure that they can't exist without
>>>>> an Account.  You could also have an offline cleanup process that swept away
>>>>> orphans.
>>>>>
>>>>>  3) Try to model your updates as idempotent column inserts instead.
>>>>>  How do you model updates as inserts?  Instead of munging the value
>>>>> directly, you could insert a column containing the operation you want to
>>>>> perform (like "+5").  It would work kind of like the Consistent Vote
>>>>> Counting implementation: ( https://gist.github.com/416666 ).  How do
>>>>> you make the inserts idempotent?  Make sure the column names correspond to
>>>>> a request ID or some other identifier that would be identical across
>>>>> re-drives of a given (perhaps originally failed) request.  This could leave
>>>>> your datastore in a temporarily inconsistent state, but would eventually
>>>>> become consistent after a successful re-drive of the original request.
>>>>>
>>>>>  4) You could take an approach like Dominic Williams proposed with
>>>>> Cages:
>>>>> http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/   The gist is that you snapshot all the original values that you're about
>>>>> to munge somewhere else (in his case, ZooKeeper), make your updates, and
>>>>> then delete the snapshot (and that delete needs to be atomic).  If the
>>>>> snapshot data was never deleted, then subsequent accessors (even readers)
>>>>> of the data rows need to do the rollback of the previous transaction
>>>>> themselves before they can read/write this data.  They do the rollback by
>>>>> just overwriting the current values with what is in the snapshot.  It
>>>>> offloads the work of the rollback to the next worker that accesses the
>>>>> data.  This approach probably needs an generic/high-level programming layer
>>>>> to handle all of the details and complexity, and it doesn't seem like it
>>>>> was ever added to Cages.
>>>>>
>>>>>
>>>>>  Are there other approaches or best practices that I missed?  I would
>>>>> be very interested in hearing any opinions from those who have tackled
>>>>> these problems before.
>>>>>
>>>>>  Thanks!
>>>>>  John
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>   --
>>>> Jérémy
>>>>
>>>
>>>
>>>
>>
>
>
> --
> http://twitter.com/tjake
>

Re: best practices for simulating transactions in Cassandra

Posted by Jake Luciani <ja...@gmail.com>.

I've written a locking mechanism for Solandra  (I refer to it as a
reservation system) which basically allows you to acquire a lock.  This is
used to ensure a node is service unique sequential IDs for lucene.

It sounds a bit similar to Dominic's description but I'll explain how the
Solandra one works.

The code is at
https://github.com/tjake/Solandra/blob/solandra/src/lucandra/cluster/CassandraIndexManager.java#L714

The algorithm is basically:

   - each node has a unique id.
   - a lock name is a row key
   - client writes to that row @ QUORUM a column name of it's ID with a TTL
of N seconds
   - client instantly reads back the entire row @ QUORUM
   - if client encounters a column that is non-expiring then the lock is
already acquired.
   - if client encounters a non-deleted but expiring column with a
timestamp < the one it wrote then it sleeps and tries again.
   - if clients own timestamp was the earliest then it has won the lock and
writes a non-expiring column of the same name to mark it as officially
locked.
   - in the case of a tie (2 columns with same ts the uuids are sorted and
the lesser one wins)
   - once finished, node with the lock deletes the column and frees the
lock.

This algorithm allows for deadlocks because the client has a huge number of
locks to work with.  It would be fairly simple to use a TTL again to make
locks auto expire after N seconds, this would make it more like google
chubby.

It also allows for bad clients to game the system but that's not something
that could be dealt with using authorization apis.

For legacy reasons the linked code uses super columns but a regular column
family will work just fine.

-Jake


On Mon, Dec 12, 2011 at 7:36 AM, Dominic Williams <
dwilliams@fightmymonster.com> wrote:

> Hi guys, just thought I'd chip in...
>
> Fight My Monster is still using Cages, which is working fine, but...
>
> I'm looking at using Cassandra to replace Cages/ZooKeeper(!) There are 2
> main reasons:-
>
> 1. Although a fast ZooKeeper cluster can handle a lot of load (we aren't
> getting anywhere near to capacity and we do a *lot* of serialisation) at
> some point it will be necessary to start hashing lock paths onto separate
> ZooKeeper clusters, and I tend to believe that these days you should choose
> platforms that handle sharding themselves (e.g. choose Cassandra rather
> than MySQL)
>
> 2. Why have more components in your system when you can have less!!! KISS
>
> Recently I therefore tried to devise an algorithm which can be used to add
> a distributed locking layer to clients such as Pelops, Hector, Pycassa etc.
>
> There is a doc describing the algorithm, to which may be added an appendix
> describing a protocol so that locking can be interoperable between the
> clients. That could be extended to describe a protocol for transactions.
> Word of warning this is a *beta* algorithm that has only been seen by a
> select group so far, and therefore not even 100% sure it works but there is
> a useful general discussion regarding serialization of reads/writes so I
> include it anyway (and since this algorithm is going to be out there now,
> if there's anyone out there who fancies doing a Z proof or disproof, that
> would be fantastic).
> http://media.fightmymonster.com/Shared/docs/Wait%20Chain%20Algorithm.pdf
>
> Final word on this re transactions: if/when transactions are added to
> locking system in Pelops/Hector/Pycassa, Cassandra will provide better
> performance than ZooKeeper for storing snapshots, especially as transaction
> size increases
>
> Best, Dominic
>
> On 11 December 2011 01:53, Guy Incognito <dn...@gmail.com> wrote:
>
>>  you could try writing with the clock of the initial replay entry?
>>
>> On 06/12/2011 20:26, John Laban wrote:
>>
>> Ah, neat.  It is similar to what was proposed in (4) above with adding
>> transactions to Cages, but instead of snapshotting the data to be rolled
>> back (the "before" data), you snapshot the data to be replayed (the "after"
>> data).  And then later, if you find that the transaction didn't complete,
>> you just keep replaying the transaction until it takes.
>>
>>  The part I don't understand with this approach though:  how do you
>> ensure that someone else didn't change the data between your initial failed
>> transaction and the later replaying of the transaction?  You could get lost
>> writes in that situation.
>>
>>  Dominic (in the Cages blog post) explained a workaround with that for
>> his rollback proposal:  all subsequent readers or writers of that data
>> would have to check for abandoned transactions and roll them back
>> themselves before they could read the data.  I don't think this is possible
>> with the XACT_LOG "replay" approach in these slides though, based on how
>> the data is indexed (cassandra node token + timeUUID).
>>
>>
>>  PS:  How are you liking Cages?
>>
>>
>>
>>
>> 2011/12/6 Jérémy SEVELLEC <js...@gmail.com>
>>
>>> Hi John,
>>>
>>>  I had exactly the same reflexions.
>>>
>>>  I'm using zookeeper and cage to lock et isolate.
>>>
>>>  but how to rollback?
>>> It's impossible so try replay!
>>>
>>>  the idea is explained in this presentation
>>> http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
>>> from slide 24)
>>>
>>>  - insert your whole data into one column
>>> - make the job
>>> - remove (or expire) your column.
>>>
>>>  if there is a problem during "making the job", you keep the
>>> possibility to replay and replay and replay (synchronously or in a batch).
>>>
>>>  Regards
>>>
>>>  Jérémy
>>>
>>>
>>> 2011/12/5 John Laban <jo...@pagerduty.com>
>>>
>>>> Hello,
>>>>
>>>>  I'm building a system using Cassandra as a datastore and I have a few
>>>> places where I am need of transactions.
>>>>
>>>>  I'm using ZooKeeper to provide locking when I'm in need of some
>>>> concurrency control or isolation, so that solves that half of the puzzle.
>>>>
>>>>  What I need now is to sometimes be able to get atomicity across
>>>> multiple writes by simulating the "begin/rollback/commit" abilities of a
>>>> relational DB.  In other words, there are places where I need to perform
>>>> multiple updates/inserts, and if I fail partway through, I would ideally be
>>>> able to rollback the partially-applied updates.
>>>>
>>>>  Now, I *know* this isn't possible with Cassandra.  What I'm looking
>>>> for are all the best practices, or at least tips and tricks, so that I can
>>>> get around this limitation in Cassandra and still maintain a consistent
>>>> datastore.  (I am using quorum reads/writes so that eventual consistency
>>>> doesn't kick my ass here as well.)
>>>>
>>>>  Below are some ideas I've been able to dig up.  Please let me know if
>>>> any of them don't make sense, or if there are better approaches:
>>>>
>>>>
>>>>  1) Updates to a row in a column family are atomic.  So try to model
>>>> your data so that you would only ever need to update a single row in a
>>>> single CF at once.  Essentially, you model your data around transactions.
>>>>  This is tricky but can certainly be done in some situations.
>>>>
>>>>  2) If you are only dealing with multiple row *inserts* (and not
>>>> updates), have one of the rows act as a 'commit' by essentially validating
>>>> the presence of the other rows.  For example, say you were performing an
>>>> operation where you wanted to create an Account row and 5 User rows all at
>>>> once (this is an unlikely example, but bear with me).  You could insert 5
>>>> rows into the Users CF, and then the 1 row into the Accounts CF, which acts
>>>> as the commit.  If something went wrong before the Account could be
>>>> created, any Users that had been created so far would be orphaned and
>>>> unusable, as your business logic can ensure that they can't exist without
>>>> an Account.  You could also have an offline cleanup process that swept away
>>>> orphans.
>>>>
>>>>  3) Try to model your updates as idempotent column inserts instead.
>>>>  How do you model updates as inserts?  Instead of munging the value
>>>> directly, you could insert a column containing the operation you want to
>>>> perform (like "+5").  It would work kind of like the Consistent Vote
>>>> Counting implementation: ( https://gist.github.com/416666 ).  How do
>>>> you make the inserts idempotent?  Make sure the column names correspond to
>>>> a request ID or some other identifier that would be identical across
>>>> re-drives of a given (perhaps originally failed) request.  This could leave
>>>> your datastore in a temporarily inconsistent state, but would eventually
>>>> become consistent after a successful re-drive of the original request.
>>>>
>>>>  4) You could take an approach like Dominic Williams proposed with
>>>> Cages:
>>>> http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/   The gist is that you snapshot all the original values that you're about
>>>> to munge somewhere else (in his case, ZooKeeper), make your updates, and
>>>> then delete the snapshot (and that delete needs to be atomic).  If the
>>>> snapshot data was never deleted, then subsequent accessors (even readers)
>>>> of the data rows need to do the rollback of the previous transaction
>>>> themselves before they can read/write this data.  They do the rollback by
>>>> just overwriting the current values with what is in the snapshot.  It
>>>> offloads the work of the rollback to the next worker that accesses the
>>>> data.  This approach probably needs an generic/high-level programming layer
>>>> to handle all of the details and complexity, and it doesn't seem like it
>>>> was ever added to Cages.
>>>>
>>>>
>>>>  Are there other approaches or best practices that I missed?  I would
>>>> be very interested in hearing any opinions from those who have tackled
>>>> these problems before.
>>>>
>>>>  Thanks!
>>>>  John
>>>>
>>>>
>>>>
>>>
>>>
>>>   --
>>> Jérémy
>>>
>>
>>
>>
>


-- 
http://twitter.com/tjake

Re: best practices for simulating transactions in Cassandra

Posted by Dominic Williams <dw...@fightmymonster.com>.

Hi guys, just thought I'd chip in...

Fight My Monster is still using Cages, which is working fine, but...

I'm looking at using Cassandra to replace Cages/ZooKeeper(!) There are 2
main reasons:-

1. Although a fast ZooKeeper cluster can handle a lot of load (we aren't
getting anywhere near to capacity and we do a *lot* of serialisation) at
some point it will be necessary to start hashing lock paths onto separate
ZooKeeper clusters, and I tend to believe that these days you should choose
platforms that handle sharding themselves (e.g. choose Cassandra rather
than MySQL)

2. Why have more components in your system when you can have less!!! KISS

Recently I therefore tried to devise an algorithm which can be used to add
a distributed locking layer to clients such as Pelops, Hector, Pycassa etc.

There is a doc describing the algorithm, to which may be added an appendix
describing a protocol so that locking can be interoperable between the
clients. That could be extended to describe a protocol for transactions.
Word of warning this is a *beta* algorithm that has only been seen by a
select group so far, and therefore not even 100% sure it works but there is
a useful general discussion regarding serialization of reads/writes so I
include it anyway (and since this algorithm is going to be out there now,
if there's anyone out there who fancies doing a Z proof or disproof, that
would be fantastic).
http://media.fightmymonster.com/Shared/docs/Wait%20Chain%20Algorithm.pdf

Final word on this re transactions: if/when transactions are added to
locking system in Pelops/Hector/Pycassa, Cassandra will provide better
performance than ZooKeeper for storing snapshots, especially as transaction
size increases

Best, Dominic

On 11 December 2011 01:53, Guy Incognito <dn...@gmail.com> wrote:

>  you could try writing with the clock of the initial replay entry?
>
> On 06/12/2011 20:26, John Laban wrote:
>
> Ah, neat.  It is similar to what was proposed in (4) above with adding
> transactions to Cages, but instead of snapshotting the data to be rolled
> back (the "before" data), you snapshot the data to be replayed (the "after"
> data).  And then later, if you find that the transaction didn't complete,
> you just keep replaying the transaction until it takes.
>
>  The part I don't understand with this approach though:  how do you
> ensure that someone else didn't change the data between your initial failed
> transaction and the later replaying of the transaction?  You could get lost
> writes in that situation.
>
>  Dominic (in the Cages blog post) explained a workaround with that for
> his rollback proposal:  all subsequent readers or writers of that data
> would have to check for abandoned transactions and roll them back
> themselves before they could read the data.  I don't think this is possible
> with the XACT_LOG "replay" approach in these slides though, based on how
> the data is indexed (cassandra node token + timeUUID).
>
>
>  PS:  How are you liking Cages?
>
>
>
>
> 2011/12/6 Jérémy SEVELLEC <js...@gmail.com>
>
>> Hi John,
>>
>>  I had exactly the same reflexions.
>>
>>  I'm using zookeeper and cage to lock et isolate.
>>
>>  but how to rollback?
>> It's impossible so try replay!
>>
>>  the idea is explained in this presentation
>> http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
>> from slide 24)
>>
>>  - insert your whole data into one column
>> - make the job
>> - remove (or expire) your column.
>>
>>  if there is a problem during "making the job", you keep the possibility
>> to replay and replay and replay (synchronously or in a batch).
>>
>>  Regards
>>
>>  Jérémy
>>
>>
>> 2011/12/5 John Laban <jo...@pagerduty.com>
>>
>>> Hello,
>>>
>>>  I'm building a system using Cassandra as a datastore and I have a few
>>> places where I am need of transactions.
>>>
>>>  I'm using ZooKeeper to provide locking when I'm in need of some
>>> concurrency control or isolation, so that solves that half of the puzzle.
>>>
>>>  What I need now is to sometimes be able to get atomicity across
>>> multiple writes by simulating the "begin/rollback/commit" abilities of a
>>> relational DB.  In other words, there are places where I need to perform
>>> multiple updates/inserts, and if I fail partway through, I would ideally be
>>> able to rollback the partially-applied updates.
>>>
>>>  Now, I *know* this isn't possible with Cassandra.  What I'm looking
>>> for are all the best practices, or at least tips and tricks, so that I can
>>> get around this limitation in Cassandra and still maintain a consistent
>>> datastore.  (I am using quorum reads/writes so that eventual consistency
>>> doesn't kick my ass here as well.)
>>>
>>>  Below are some ideas I've been able to dig up.  Please let me know if
>>> any of them don't make sense, or if there are better approaches:
>>>
>>>
>>>  1) Updates to a row in a column family are atomic.  So try to model
>>> your data so that you would only ever need to update a single row in a
>>> single CF at once.  Essentially, you model your data around transactions.
>>>  This is tricky but can certainly be done in some situations.
>>>
>>>  2) If you are only dealing with multiple row *inserts* (and not
>>> updates), have one of the rows act as a 'commit' by essentially validating
>>> the presence of the other rows.  For example, say you were performing an
>>> operation where you wanted to create an Account row and 5 User rows all at
>>> once (this is an unlikely example, but bear with me).  You could insert 5
>>> rows into the Users CF, and then the 1 row into the Accounts CF, which acts
>>> as the commit.  If something went wrong before the Account could be
>>> created, any Users that had been created so far would be orphaned and
>>> unusable, as your business logic can ensure that they can't exist without
>>> an Account.  You could also have an offline cleanup process that swept away
>>> orphans.
>>>
>>>  3) Try to model your updates as idempotent column inserts instead.
>>>  How do you model updates as inserts?  Instead of munging the value
>>> directly, you could insert a column containing the operation you want to
>>> perform (like "+5").  It would work kind of like the Consistent Vote
>>> Counting implementation: ( https://gist.github.com/416666 ).  How do
>>> you make the inserts idempotent?  Make sure the column names correspond to
>>> a request ID or some other identifier that would be identical across
>>> re-drives of a given (perhaps originally failed) request.  This could leave
>>> your datastore in a temporarily inconsistent state, but would eventually
>>> become consistent after a successful re-drive of the original request.
>>>
>>>  4) You could take an approach like Dominic Williams proposed with
>>> Cages:
>>> http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/   The gist is that you snapshot all the original values that you're about
>>> to munge somewhere else (in his case, ZooKeeper), make your updates, and
>>> then delete the snapshot (and that delete needs to be atomic).  If the
>>> snapshot data was never deleted, then subsequent accessors (even readers)
>>> of the data rows need to do the rollback of the previous transaction
>>> themselves before they can read/write this data.  They do the rollback by
>>> just overwriting the current values with what is in the snapshot.  It
>>> offloads the work of the rollback to the next worker that accesses the
>>> data.  This approach probably needs an generic/high-level programming layer
>>> to handle all of the details and complexity, and it doesn't seem like it
>>> was ever added to Cages.
>>>
>>>
>>>  Are there other approaches or best practices that I missed?  I would
>>> be very interested in hearing any opinions from those who have tackled
>>> these problems before.
>>>
>>>  Thanks!
>>>  John
>>>
>>>
>>>
>>
>>
>>   --
>> Jérémy
>>
>
>
>

Re: best practices for simulating transactions in Cassandra

Posted by Guy Incognito <dn...@gmail.com>.

you could try writing with the clock of the initial replay entry?

On 06/12/2011 20:26, John Laban wrote:
> Ah, neat.  It is similar to what was proposed in (4) above with adding 
> transactions to Cages, but instead of snapshotting the data to be 
> rolled back (the "before" data), you snapshot the data to be replayed 
> (the "after" data).  And then later, if you find that the transaction 
> didn't complete, you just keep replaying the transaction until it takes.
>
> The part I don't understand with this approach though:  how do you 
> ensure that someone else didn't change the data between your initial 
> failed transaction and the later replaying of the transaction?  You 
> could get lost writes in that situation.
>
> Dominic (in the Cages blog post) explained a workaround with that for 
> his rollback proposal:  all subsequent readers or writers of that data 
> would have to check for abandoned transactions and roll them back 
> themselves before they could read the data.  I don't think this is 
> possible with the XACT_LOG "replay" approach in these slides though, 
> based on how the data is indexed (cassandra node token + timeUUID).
>
>
> PS:  How are you liking Cages?
>
>
>
>
> 2011/12/6 Jérémy SEVELLEC <jsevellec@gmail.com 
> <ma...@gmail.com>>
>
>     Hi John,
>
>     I had exactly the same reflexions.
>
>     I'm using zookeeper and cage to lock et isolate.
>
>     but how to rollback?
>     It's impossible so try replay!
>
>     the idea is explained in this presentation
>     http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
>     from slide 24)
>
>     - insert your whole data into one column
>     - make the job
>     - remove (or expire) your column.
>
>     if there is a problem during "making the job", you keep the
>     possibility to replay and replay and replay (synchronously or in a
>     batch).
>
>     Regards
>
>     Jérémy
>
>
>     2011/12/5 John Laban <john@pagerduty.com <ma...@pagerduty.com>>
>
>         Hello,
>
>         I'm building a system using Cassandra as a datastore and I
>         have a few places where I am need of transactions.
>
>         I'm using ZooKeeper to provide locking when I'm in need of
>         some concurrency control or isolation, so that solves that
>         half of the puzzle.
>
>         What I need now is to sometimes be able to get atomicity
>         across multiple writes by simulating the
>         "begin/rollback/commit" abilities of a relational DB.  In
>         other words, there are places where I need to perform multiple
>         updates/inserts, and if I fail partway through, I would
>         ideally be able to rollback the partially-applied updates.
>
>         Now, I *know* this isn't possible with Cassandra.  What I'm
>         looking for are all the best practices, or at least tips and
>         tricks, so that I can get around this limitation in Cassandra
>         and still maintain a consistent datastore.  (I am using quorum
>         reads/writes so that eventual consistency doesn't kick my ass
>         here as well.)
>
>         Below are some ideas I've been able to dig up.  Please let me
>         know if any of them don't make sense, or if there are better
>         approaches:
>
>
>         1) Updates to a row in a column family are atomic.  So try to
>         model your data so that you would only ever need to update a
>         single row in a single CF at once.  Essentially, you model
>         your data around transactions.  This is tricky but can
>         certainly be done in some situations.
>
>         2) If you are only dealing with multiple row *inserts* (and
>         not updates), have one of the rows act as a 'commit' by
>         essentially validating the presence of the other rows.  For
>         example, say you were performing an operation where you wanted
>         to create an Account row and 5 User rows all at once (this is
>         an unlikely example, but bear with me).  You could insert 5
>         rows into the Users CF, and then the 1 row into the Accounts
>         CF, which acts as the commit.  If something went wrong before
>         the Account could be created, any Users that had been created
>         so far would be orphaned and unusable, as your business logic
>         can ensure that they can't exist without an Account.  You
>         could also have an offline cleanup process that swept away
>         orphans.
>
>         3) Try to model your updates as idempotent column inserts
>         instead.  How do you model updates as inserts?  Instead of
>         munging the value directly, you could insert a column
>         containing the operation you want to perform (like "+5").  It
>         would work kind of like the Consistent Vote Counting
>         implementation: ( https://gist.github.com/416666 ).  How do
>         you make the inserts idempotent?  Make sure the column names
>         correspond to a request ID or some other identifier that would
>         be identical across re-drives of a given (perhaps originally
>         failed) request.  This could leave your datastore in a
>         temporarily inconsistent state, but would eventually become
>         consistent after a successful re-drive of the original request.
>
>         4) You could take an approach like Dominic Williams proposed
>         with Cages:
>         http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/
>            The gist is that you snapshot all the original values that
>         you're about to munge somewhere else (in his case, ZooKeeper),
>         make your updates, and then delete the snapshot (and that
>         delete needs to be atomic).  If the snapshot data was never
>         deleted, then subsequent accessors (even readers) of the data
>         rows need to do the rollback of the previous transaction
>         themselves before they can read/write this data.  They do the
>         rollback by just overwriting the current values with what is
>         in the snapshot.  It offloads the work of the rollback to the
>         next worker that accesses the data.  This approach probably
>         needs an generic/high-level programming layer to handle all of
>         the details and complexity, and it doesn't seem like it was
>         ever added to Cages.
>
>
>         Are there other approaches or best practices that I missed?  I
>         would be very interested in hearing any opinions from those
>         who have tackled these problems before.
>
>         Thanks!
>         John
>
>
>
>
>
>     -- 
>     Jérémy
>
>

Re: best practices for simulating transactions in Cassandra

Posted by Jérémy SEVELLEC <js...@gmail.com>.

I dont' use exactly the same approach describe in the presentation. I use
the idea to write whole data in one time in one column value to have
atomicity.

My application make a read before write... to verify the state of the last
transactionlog. If there is data in my "transactionlog" column, it' because
there was a problem and it's impossible to go further. It's perhaps not
very efficient, but it's transactional... it's a tradeoff...

I could not find other solutions but if you have one, I'm interested!
Perhaps, other cassandra user found other solutions?

Cage is pretty cool and really simplify the use of zookeeper because it's
focused on lock with a comprehensive api. It's not very well documented and
you have to put your hand into the code to understand how to use it (to
know ho to connect to a zookeeper "cluster" for example...).

Jérémy


2011/12/6 John Laban <jo...@pagerduty.com>

> Ah, neat.  It is similar to what was proposed in (4) above with adding
> transactions to Cages, but instead of snapshotting the data to be rolled
> back (the "before" data), you snapshot the data to be replayed (the "after"
> data).  And then later, if you find that the transaction didn't complete,
> you just keep replaying the transaction until it takes.
>
> The part I don't understand with this approach though:  how do you ensure
> that someone else didn't change the data between your initial failed
> transaction and the later replaying of the transaction?  You could get lost
> writes in that situation.
>
> Dominic (in the Cages blog post) explained a workaround with that for his
> rollback proposal:  all subsequent readers or writers of that data would
> have to check for abandoned transactions and roll them back themselves
> before they could read the data.  I don't think this is possible with the
> XACT_LOG "replay" approach in these slides though, based on how the data is
> indexed (cassandra node token + timeUUID).
>
>
> PS:  How are you liking Cages?
>
>
>
>
>
> 2011/12/6 Jérémy SEVELLEC <js...@gmail.com>
>
>> Hi John,
>>
>> I had exactly the same reflexions.
>>
>> I'm using zookeeper and cage to lock et isolate.
>>
>> but how to rollback?
>> It's impossible so try replay!
>>
>> the idea is explained in this presentation
>> http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
>> from slide 24)
>>
>> - insert your whole data into one column
>> - make the job
>> - remove (or expire) your column.
>>
>> if there is a problem during "making the job", you keep the possibility
>> to replay and replay and replay (synchronously or in a batch).
>>
>> Regards
>>
>> Jérémy
>>
>>
>> 2011/12/5 John Laban <jo...@pagerduty.com>
>>
>>> Hello,
>>>
>>> I'm building a system using Cassandra as a datastore and I have a few
>>> places where I am need of transactions.
>>>
>>> I'm using ZooKeeper to provide locking when I'm in need of some
>>> concurrency control or isolation, so that solves that half of the puzzle.
>>>
>>> What I need now is to sometimes be able to get atomicity across multiple
>>> writes by simulating the "begin/rollback/commit" abilities of a relational
>>> DB.  In other words, there are places where I need to perform multiple
>>> updates/inserts, and if I fail partway through, I would ideally be able to
>>> rollback the partially-applied updates.
>>>
>>> Now, I *know* this isn't possible with Cassandra.  What I'm looking for
>>> are all the best practices, or at least tips and tricks, so that I can get
>>> around this limitation in Cassandra and still maintain a consistent
>>> datastore.  (I am using quorum reads/writes so that eventual consistency
>>> doesn't kick my ass here as well.)
>>>
>>> Below are some ideas I've been able to dig up.  Please let me know if
>>> any of them don't make sense, or if there are better approaches:
>>>
>>>
>>> 1) Updates to a row in a column family are atomic.  So try to model your
>>> data so that you would only ever need to update a single row in a single CF
>>> at once.  Essentially, you model your data around transactions.  This is
>>> tricky but can certainly be done in some situations.
>>>
>>> 2) If you are only dealing with multiple row *inserts* (and not
>>> updates), have one of the rows act as a 'commit' by essentially validating
>>> the presence of the other rows.  For example, say you were performing an
>>> operation where you wanted to create an Account row and 5 User rows all at
>>> once (this is an unlikely example, but bear with me).  You could insert 5
>>> rows into the Users CF, and then the 1 row into the Accounts CF, which acts
>>> as the commit.  If something went wrong before the Account could be
>>> created, any Users that had been created so far would be orphaned and
>>> unusable, as your business logic can ensure that they can't exist without
>>> an Account.  You could also have an offline cleanup process that swept away
>>> orphans.
>>>
>>> 3) Try to model your updates as idempotent column inserts instead.  How
>>> do you model updates as inserts?  Instead of munging the value directly,
>>> you could insert a column containing the operation you want to perform
>>> (like "+5").  It would work kind of like the Consistent Vote Counting
>>> implementation: ( https://gist.github.com/416666 ).  How do you make
>>> the inserts idempotent?  Make sure the column names correspond to a request
>>> ID or some other identifier that would be identical across re-drives of a
>>> given (perhaps originally failed) request.  This could leave your datastore
>>> in a temporarily inconsistent state, but would eventually become consistent
>>> after a successful re-drive of the original request.
>>>
>>> 4) You could take an approach like Dominic Williams proposed with Cages:
>>>
>>> http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/   The gist is that you snapshot all the original values that you're about
>>> to munge somewhere else (in his case, ZooKeeper), make your updates, and
>>> then delete the snapshot (and that delete needs to be atomic).  If the
>>> snapshot data was never deleted, then subsequent accessors (even readers)
>>> of the data rows need to do the rollback of the previous transaction
>>> themselves before they can read/write this data.  They do the rollback by
>>> just overwriting the current values with what is in the snapshot.  It
>>> offloads the work of the rollback to the next worker that accesses the
>>> data.  This approach probably needs an generic/high-level programming layer
>>> to handle all of the details and complexity, and it doesn't seem like it
>>> was ever added to Cages.
>>>
>>>
>>> Are there other approaches or best practices that I missed?  I would be
>>> very interested in hearing any opinions from those who have tackled these
>>> problems before.
>>>
>>> Thanks!
>>> John
>>>
>>>
>>>
>>
>>
>> --
>> Jérémy
>>
>
>


-- 
Jérémy

Re: best practices for simulating transactions in Cassandra

Posted by John Laban <jo...@pagerduty.com>.

Ah, neat.  It is similar to what was proposed in (4) above with adding
transactions to Cages, but instead of snapshotting the data to be rolled
back (the "before" data), you snapshot the data to be replayed (the "after"
data).  And then later, if you find that the transaction didn't complete,
you just keep replaying the transaction until it takes.

The part I don't understand with this approach though:  how do you ensure
that someone else didn't change the data between your initial failed
transaction and the later replaying of the transaction?  You could get lost
writes in that situation.

Dominic (in the Cages blog post) explained a workaround with that for his
rollback proposal:  all subsequent readers or writers of that data would
have to check for abandoned transactions and roll them back themselves
before they could read the data.  I don't think this is possible with the
XACT_LOG "replay" approach in these slides though, based on how the data is
indexed (cassandra node token + timeUUID).


PS:  How are you liking Cages?




2011/12/6 Jérémy SEVELLEC <js...@gmail.com>

> Hi John,
>
> I had exactly the same reflexions.
>
> I'm using zookeeper and cage to lock et isolate.
>
> but how to rollback?
> It's impossible so try replay!
>
> the idea is explained in this presentation
> http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
> from slide 24)
>
> - insert your whole data into one column
> - make the job
> - remove (or expire) your column.
>
> if there is a problem during "making the job", you keep the possibility to
> replay and replay and replay (synchronously or in a batch).
>
> Regards
>
> Jérémy
>
>
> 2011/12/5 John Laban <jo...@pagerduty.com>
>
>> Hello,
>>
>> I'm building a system using Cassandra as a datastore and I have a few
>> places where I am need of transactions.
>>
>> I'm using ZooKeeper to provide locking when I'm in need of some
>> concurrency control or isolation, so that solves that half of the puzzle.
>>
>> What I need now is to sometimes be able to get atomicity across multiple
>> writes by simulating the "begin/rollback/commit" abilities of a relational
>> DB.  In other words, there are places where I need to perform multiple
>> updates/inserts, and if I fail partway through, I would ideally be able to
>> rollback the partially-applied updates.
>>
>> Now, I *know* this isn't possible with Cassandra.  What I'm looking for
>> are all the best practices, or at least tips and tricks, so that I can get
>> around this limitation in Cassandra and still maintain a consistent
>> datastore.  (I am using quorum reads/writes so that eventual consistency
>> doesn't kick my ass here as well.)
>>
>> Below are some ideas I've been able to dig up.  Please let me know if any
>> of them don't make sense, or if there are better approaches:
>>
>>
>> 1) Updates to a row in a column family are atomic.  So try to model your
>> data so that you would only ever need to update a single row in a single CF
>> at once.  Essentially, you model your data around transactions.  This is
>> tricky but can certainly be done in some situations.
>>
>> 2) If you are only dealing with multiple row *inserts* (and not updates),
>> have one of the rows act as a 'commit' by essentially validating the
>> presence of the other rows.  For example, say you were performing an
>> operation where you wanted to create an Account row and 5 User rows all at
>> once (this is an unlikely example, but bear with me).  You could insert 5
>> rows into the Users CF, and then the 1 row into the Accounts CF, which acts
>> as the commit.  If something went wrong before the Account could be
>> created, any Users that had been created so far would be orphaned and
>> unusable, as your business logic can ensure that they can't exist without
>> an Account.  You could also have an offline cleanup process that swept away
>> orphans.
>>
>> 3) Try to model your updates as idempotent column inserts instead.  How
>> do you model updates as inserts?  Instead of munging the value directly,
>> you could insert a column containing the operation you want to perform
>> (like "+5").  It would work kind of like the Consistent Vote Counting
>> implementation: ( https://gist.github.com/416666 ).  How do you make the
>> inserts idempotent?  Make sure the column names correspond to a request ID
>> or some other identifier that would be identical across re-drives of a
>> given (perhaps originally failed) request.  This could leave your datastore
>> in a temporarily inconsistent state, but would eventually become consistent
>> after a successful re-drive of the original request.
>>
>> 4) You could take an approach like Dominic Williams proposed with Cages:
>> http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/   The gist is that you snapshot all the original values that you're about
>> to munge somewhere else (in his case, ZooKeeper), make your updates, and
>> then delete the snapshot (and that delete needs to be atomic).  If the
>> snapshot data was never deleted, then subsequent accessors (even readers)
>> of the data rows need to do the rollback of the previous transaction
>> themselves before they can read/write this data.  They do the rollback by
>> just overwriting the current values with what is in the snapshot.  It
>> offloads the work of the rollback to the next worker that accesses the
>> data.  This approach probably needs an generic/high-level programming layer
>> to handle all of the details and complexity, and it doesn't seem like it
>> was ever added to Cages.
>>
>>
>> Are there other approaches or best practices that I missed?  I would be
>> very interested in hearing any opinions from those who have tackled these
>> problems before.
>>
>> Thanks!
>> John
>>
>>
>>
>
>
> --
> Jérémy
>

Re: best practices for simulating transactions in Cassandra

Posted by Jérémy SEVELLEC <js...@gmail.com>.

Hi John,

I had exactly the same reflexions.

I'm using zookeeper and cage to lock et isolate.

but how to rollback?
It's impossible so try replay!

the idea is explained in this presentation
http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting from
slide 24)

- insert your whole data into one column
- make the job
- remove (or expire) your column.

if there is a problem during "making the job", you keep the possibility to
replay and replay and replay (synchronously or in a batch).

Regards

Jérémy


2011/12/5 John Laban <jo...@pagerduty.com>

> Hello,
>
> I'm building a system using Cassandra as a datastore and I have a few
> places where I am need of transactions.
>
> I'm using ZooKeeper to provide locking when I'm in need of some
> concurrency control or isolation, so that solves that half of the puzzle.
>
> What I need now is to sometimes be able to get atomicity across multiple
> writes by simulating the "begin/rollback/commit" abilities of a relational
> DB.  In other words, there are places where I need to perform multiple
> updates/inserts, and if I fail partway through, I would ideally be able to
> rollback the partially-applied updates.
>
> Now, I *know* this isn't possible with Cassandra.  What I'm looking for
> are all the best practices, or at least tips and tricks, so that I can get
> around this limitation in Cassandra and still maintain a consistent
> datastore.  (I am using quorum reads/writes so that eventual consistency
> doesn't kick my ass here as well.)
>
> Below are some ideas I've been able to dig up.  Please let me know if any
> of them don't make sense, or if there are better approaches:
>
>
> 1) Updates to a row in a column family are atomic.  So try to model your
> data so that you would only ever need to update a single row in a single CF
> at once.  Essentially, you model your data around transactions.  This is
> tricky but can certainly be done in some situations.
>
> 2) If you are only dealing with multiple row *inserts* (and not updates),
> have one of the rows act as a 'commit' by essentially validating the
> presence of the other rows.  For example, say you were performing an
> operation where you wanted to create an Account row and 5 User rows all at
> once (this is an unlikely example, but bear with me).  You could insert 5
> rows into the Users CF, and then the 1 row into the Accounts CF, which acts
> as the commit.  If something went wrong before the Account could be
> created, any Users that had been created so far would be orphaned and
> unusable, as your business logic can ensure that they can't exist without
> an Account.  You could also have an offline cleanup process that swept away
> orphans.
>
> 3) Try to model your updates as idempotent column inserts instead.  How do
> you model updates as inserts?  Instead of munging the value directly, you
> could insert a column containing the operation you want to perform (like
> "+5").  It would work kind of like the Consistent Vote Counting
> implementation: ( https://gist.github.com/416666 ).  How do you make the
> inserts idempotent?  Make sure the column names correspond to a request ID
> or some other identifier that would be identical across re-drives of a
> given (perhaps originally failed) request.  This could leave your datastore
> in a temporarily inconsistent state, but would eventually become consistent
> after a successful re-drive of the original request.
>
> 4) You could take an approach like Dominic Williams proposed with Cages:
> http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/   The gist is that you snapshot all the original values that you're about
> to munge somewhere else (in his case, ZooKeeper), make your updates, and
> then delete the snapshot (and that delete needs to be atomic).  If the
> snapshot data was never deleted, then subsequent accessors (even readers)
> of the data rows need to do the rollback of the previous transaction
> themselves before they can read/write this data.  They do the rollback by
> just overwriting the current values with what is in the snapshot.  It
> offloads the work of the rollback to the next worker that accesses the
> data.  This approach probably needs an generic/high-level programming layer
> to handle all of the details and complexity, and it doesn't seem like it
> was ever added to Cages.
>
>
> Are there other approaches or best practices that I missed?  I would be
> very interested in hearing any opinions from those who have tackled these
> problems before.
>
> Thanks!
> John
>
>
>


-- 
Jérémy