You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Max C <mc...@core43.com> on 2016/03/17 07:22:01 UTC

Strategies for avoiding corrupted duplicate data?

Hello,

What are your best practices for avoiding collisions when updating duplicate (derived) data?  For example if I have tables like this:

users — (username, email, phone) — PK (username)
users_by_email — (username, email, phone) — PK (email)
users_by_phone — (username, email, phone) — PK ((phone), username)

… and I want to change the user’s email address ...

What is your strategy for ensuring that two writers don’t try to update the same records at the same time?

Some sort of lock?

1) Do you have a global “locks” table in Cassandra, and use a LWT to lock it?

insert into locks (what, locked_by) values ('users:maxc', ’server1_pid1234') if not exists;
# Verify insert succeeded
# Now that I have the lock, grab the latest 
select * from users where username=‘maxc’;   
begin batch
# … apply changes ...
apply batch;
# release lock
delete from locks where what=‘users:maxc’;  

2) Do you add a “locked_by” column to the master (“users”) table, and then use a LWT to lock it?

update users set locked_by=‘server1_pid1234’’ where username=“maxc” if locked_by = null;
# Verify update succeeded
# Now that I have the lock, grab the latest 
select * from users where username=‘maxc’;   
begin batch
# … apply changes ...
apply batch;
# release lock
update users set locked_by=null where username=‘maxc’; 

3) Do you use something outside of Cassandra to manage the locks?  Zoo keeper?  

## Acquire external lock ##
# Now that I have the lock, grab the latest 
select * from users where username=‘maxc’;   
begin batch
# … apply changes ...
apply batch;
## release external lock ##

Or is there some other way to do this that I’m totally missing??  Materialized views in 3.x, I suppose.  Other ideas?

Thanks!

- Max

Re: Strategies for avoiding corrupted duplicate data?

Posted by Carlos Alonso <in...@mrcalonso.com>.
Hi Max.

The best way to achieve this is by actually having idempotent operations.
So, in your particular case, is there any chance for a user to update his
email twice at the same time with different values? Given that Cassandra
upserts information, even two inserts in the same table with the same
values will lead to just one.

If you cannot guarantee that, then a good way to achieve this I think is by
read-before-write + optimistic locking. i.e. Adding a 'version' field on
the record and running a LWT.

For example:

UPDATE users_by_email set email = 'new_email', version = 2 where ... IF
version = 1 (version number previously read, of course)

Hope this helps.

Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>

On 17 March 2016 at 07:22, Max C <mc...@core43.com> wrote:

> Hello,
>
> What are your best practices for avoiding collisions when updating
> duplicate (derived) data?  For example if I have tables like this:
>
> users — (username, email, phone) — PK (username)
> users_by_email — (username, email, phone) — PK (email)
> users_by_phone — (username, email, phone) — PK ((phone), username)
>
> … and I want to change the user’s email address ...
>
> What is your strategy for ensuring that two writers don’t try to update
> the same records at the same time?
>
> Some sort of lock?
>
> 1) Do you have a global “locks” table in Cassandra, and use a LWT to lock
> it?
>
> insert into locks (what, locked_by) values ('users:maxc',
> ’server1_pid1234') if not exists;
> # Verify insert succeeded
> # Now that I have the lock, grab the latest
> select * from users where username=‘maxc’;
> begin batch
> # … apply changes ...
> apply batch;
> # release lock
> delete from locks where what=‘users:maxc’;
>
> 2) Do you add a “locked_by” column to the master (“users”) table, and then
> use a LWT to lock it?
>
> update users set locked_by=‘server1_pid1234’’ where username=“maxc” if
> locked_by = null;
> # Verify update succeeded
> # Now that I have the lock, grab the latest
> select * from users where username=‘maxc’;
> begin batch
> # … apply changes ...
> apply batch;
> # release lock
> update users set locked_by=null where username=‘maxc’;
>
> 3) Do you use something outside of Cassandra to manage the locks?  Zoo
> keeper?
>
> ## Acquire external lock ##
> # Now that I have the lock, grab the latest
> select * from users where username=‘maxc’;
> begin batch
> # … apply changes ...
> apply batch;
> ## release external lock ##
>
> Or is there some other way to do this that I’m totally missing??
> Materialized views in 3.x, I suppose.  Other ideas?
>
> Thanks!
>
> - Max
>

Re: Strategies for avoiding corrupted duplicate data?

Posted by Clint Martin <cl...@coolfiretechnologies.com>.
Light weight transactions are going to be somewhat key to this. As are
batches.

The interesting thing about these views is that changing an email address
is not the same operation on all of them.

For The users by email view you have to delete a given existing row and
insert a new one.

For the others an update using a lwt to ensure the existing email is what
you think it is, will be sufficient.

The batch is necessary because you inherently have a race condition when
updating all of these tables. Two really. One as you update each table. And
one where you are concerned about where two updates for different values
occur at the same time. These two cases are related and interact poorly

If you are using Cassandra 3.x materialized views would be a good solution
for this.

Clint





On Mar 17, 2016 2:22 AM, "Max C" <mc...@core43.com> wrote:

> Hello,
>
> What are your best practices for avoiding collisions when updating
> duplicate (derived) data?  For example if I have tables like this:
>
> users — (username, email, phone) — PK (username)
> users_by_email — (username, email, phone) — PK (email)
> users_by_phone — (username, email, phone) — PK ((phone), username)
>
> … and I want to change the user’s email address ...
>
> What is your strategy for ensuring that two writers don’t try to update
> the same records at the same time?
>
> Some sort of lock?
>
> 1) Do you have a global “locks” table in Cassandra, and use a LWT to lock
> it?
>
> insert into locks (what, locked_by) values ('users:maxc',
> ’server1_pid1234') if not exists;
> # Verify insert succeeded
> # Now that I have the lock, grab the latest
> select * from users where username=‘maxc’;
> begin batch
> # … apply changes ...
> apply batch;
> # release lock
> delete from locks where what=‘users:maxc’;
>
> 2) Do you add a “locked_by” column to the master (“users”) table, and then
> use a LWT to lock it?
>
> update users set locked_by=‘server1_pid1234’’ where username=“maxc” if
> locked_by = null;
> # Verify update succeeded
> # Now that I have the lock, grab the latest
> select * from users where username=‘maxc’;
> begin batch
> # … apply changes ...
> apply batch;
> # release lock
> update users set locked_by=null where username=‘maxc’;
>
> 3) Do you use something outside of Cassandra to manage the locks?  Zoo
> keeper?
>
> ## Acquire external lock ##
> # Now that I have the lock, grab the latest
> select * from users where username=‘maxc’;
> begin batch
> # … apply changes ...
> apply batch;
> ## release external lock ##
>
> Or is there some other way to do this that I’m totally missing??
> Materialized views in 3.x, I suppose.  Other ideas?
>
> Thanks!
>
> - Max
>