You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Ajaya Agrawal <aj...@gmail.com> on 2015/02/06 15:02:36 UTC

Mutable primary key in a table

Hi guys,

I want to take a row with primary key K1, rewrite it with primary key K2,
and delete the original data with key K1, atomically.

It seems like the only solution which won't have race conditions is to use
batch statement to delete the old row and insert the new one. But the
documentation of batch operation makes me nervous. The specific parts in
docs are the ones which say that all nodes in your cluster become stressed
if you use logged batches(default one).

Is it a solved problem already?
Cheers,
Ajaya

Re: Mutable primary key in a table

Posted by Colin Clark <co...@clark.ws>.
No need for CAS in my suggestion - I would try to avoid the use of CAS if at all possible.  

It’s better in a distributed environment to reduce dimensionality and isolate write/read paths (event sourcing and CQRS patterns).

Also, just in general, changing the primary key on an update is usually considered a bad idea and is simply not even permitted by most RDBMS.
—
Colin Clark
colin@clark.ws
+1 320-221-9531
skype colin.p.clark

> On Feb 8, 2015, at 4:16 PM, Eric Stevens <mi...@gmail.com> wrote:
> 
> It sounds like changing user names is the kind of thing which doesn't happen often, in which case you probably don't have to worry too much about the additional overhead of using logged batches (not like you're going to be doing hundreds to thousands of these per second).  You probably also want to look into conditional updates (search for Compare And Set - CAS) to help avoid collisions when creating or renaming users.
> 
> Colin's suggestion of using a surrogate key for the primary key on the user table is also a good idea, but you'll still want to use CAS to help maintain the integrity of your data.  Note that CAS has a similar overhead to logged batches in that it also involves a Paxos round.  So keep the number of statements in either CAS or logged batches as minimal as possible.
> 
> On Sun, Feb 8, 2015 at 7:17 AM, Colin <colin@clark.ws <ma...@clark.ws>> wrote:
> Another way to do this is to use a time based uuid for the primary key (partition key) and to store the user name with that uuid.
> 
> In addition, you'll need 2 additonal tables, one that is used to get the uuid by user name and another to track user name changes over time which would be organized by uuid, and user name (cluster on the name).
> 
> This pattern is referred to as an inverted index and provides a lot of power and flexibility once mastered.  I use it all the time with cassandra - in fact, to be successful with cassandra, it might actually be a requirement!
> 
> --
> Colin Clark 
> +1 612 859 6129 <tel:%2B1%20612%20859%206129>
> Skype colin.p.clark
> 
> On Feb 8, 2015, at 8:08 AM, Jack Krupansky <jack.krupansky@gmail.com <ma...@gmail.com>> wrote:
> 
>> What is your full primary key? Specifically, what is the partition key, as opposed to clustering columns?
>> 
>> The point is that the partition key for a row is hashed to determine the token for the partition, which in turn determines which node of the cluster owns that partition. Changing the partition key means that potentially the partition would need to be "moved" to another node, which is clearly not something that Cassandra would do since the core design of Cassandra is that all operations should be blazingly fast and to refrain from offering slow features.
>> 
>> I would recommend that your application:
>> 
>> 1. Read the existing user data
>> 2. Create a new user, using the existing user data.
>> 3. Update the old user row to indicate that it is no longer a valid user. Actually, you will have to decide on an application policy for old user names. For example, can they be reused, or are they locked, or... whatever.
>> 
>> 
>> -- Jack Krupansky
>> 
>> On Sun, Feb 8, 2015 at 1:48 AM, Ajaya Agrawal <ajku.agr@gmail.com <ma...@gmail.com>> wrote:
>> 
>> On Sun, Feb 8, 2015 at 5:03 AM, Eric Stevens <mightye@gmail.com <ma...@gmail.com>> wrote:
>> I'm struggling to think of a model where it makes sense to update a primary key as a typical operation.  It suggests, as Adil said, that you may be reasoning wrong about your data model.  Maybe you can explain your problem in more detail - what kind of thing has you updating your PK on a regular basis?
>> 
>> I have a 'user' table which has a column called 'user_name' and other columns like name, city etc. The application requires that user_name be unique and user should be searchable by 'user_name'. The only way to do this in C* would be to make user_name column primary key. Things get trickier when there is a requirement which says that user_name can be changed by the users of the application. This a distributed application which mean that it runs on multiple nodes. If I have to change user_name atomically then either I need to implement distributed locking or use something C* provides.   
>> 
>> 
> 


Re: Mutable primary key in a table

Posted by Eric Stevens <mi...@gmail.com>.
It sounds like changing user names is the kind of thing which doesn't
happen often, in which case you probably don't have to worry too much about
the additional overhead of using logged batches (not like you're going to
be doing hundreds to thousands of these per second).  You probably also
want to look into conditional updates (search for Compare And Set - CAS) to
help avoid collisions when creating or renaming users.

Colin's suggestion of using a surrogate key for the primary key on the user
table is also a good idea, but you'll still want to use CAS to help
maintain the integrity of your data.  Note that CAS has a similar overhead
to logged batches in that it also involves a Paxos round.  So keep the
number of statements in either CAS or logged batches as minimal as possible.

On Sun, Feb 8, 2015 at 7:17 AM, Colin <co...@clark.ws> wrote:

> Another way to do this is to use a time based uuid for the primary key
> (partition key) and to store the user name with that uuid.
>
> In addition, you'll need 2 additonal tables, one that is used to get the
> uuid by user name and another to track user name changes over time which
> would be organized by uuid, and user name (cluster on the name).
>
> This pattern is referred to as an inverted index and provides a lot of
> power and flexibility once mastered.  I use it all the time with cassandra
> - in fact, to be successful with cassandra, it might actually be a
> requirement!
>
> --
> *Colin Clark*
> +1 612 859 6129
> Skype colin.p.clark
>
> On Feb 8, 2015, at 8:08 AM, Jack Krupansky <ja...@gmail.com>
> wrote:
>
> What is your full primary key? Specifically, what is the partition key, as
> opposed to clustering columns?
>
> The point is that the partition key for a row is hashed to determine the
> token for the partition, which in turn determines which node of the cluster
> owns that partition. Changing the partition key means that potentially the
> partition would need to be "moved" to another node, which is clearly not
> something that Cassandra would do since the core design of Cassandra is
> that all operations should be blazingly fast and to refrain from offering
> slow features.
>
> I would recommend that your application:
>
> 1. Read the existing user data
> 2. Create a new user, using the existing user data.
> 3. Update the old user row to indicate that it is no longer a valid user.
> Actually, you will have to decide on an application policy for old user
> names. For example, can they be reused, or are they locked, or... whatever.
>
>
> -- Jack Krupansky
>
> On Sun, Feb 8, 2015 at 1:48 AM, Ajaya Agrawal <aj...@gmail.com> wrote:
>
>>
>> On Sun, Feb 8, 2015 at 5:03 AM, Eric Stevens <mi...@gmail.com> wrote:
>>
>>> I'm struggling to think of a model where it makes sense to update a
>>> primary key as a typical operation.  It suggests, as Adil said, that you
>>> may be reasoning wrong about your data model.  Maybe you can explain your
>>> problem in more detail - what kind of thing has you updating your PK on a
>>> regular basis?
>>>
>>> I have a 'user' table which has a column called 'user_name' and other
>> columns like name, city etc. The application requires that user_name be
>> unique and user should be searchable by 'user_name'. The only way to do
>> this in C* would be to make user_name column primary key. Things get
>> trickier when there is a requirement which says that user_name can be
>> changed by the users of the application. This a distributed application
>> which mean that it runs on multiple nodes. If I have to change user_name
>> atomically then either I need to implement distributed locking or use
>> something C* provides.
>>
>>
>

Re: Mutable primary key in a table

Posted by Colin <co...@clark.ws>.
Another way to do this is to use a time based uuid for the primary key (partition key) and to store the user name with that uuid.

In addition, you'll need 2 additonal tables, one that is used to get the uuid by user name and another to track user name changes over time which would be organized by uuid, and user name (cluster on the name).

This pattern is referred to as an inverted index and provides a lot of power and flexibility once mastered.  I use it all the time with cassandra - in fact, to be successful with cassandra, it might actually be a requirement!

--
Colin Clark 
+1 612 859 6129
Skype colin.p.clark

> On Feb 8, 2015, at 8:08 AM, Jack Krupansky <ja...@gmail.com> wrote:
> 
> What is your full primary key? Specifically, what is the partition key, as opposed to clustering columns?
> 
> The point is that the partition key for a row is hashed to determine the token for the partition, which in turn determines which node of the cluster owns that partition. Changing the partition key means that potentially the partition would need to be "moved" to another node, which is clearly not something that Cassandra would do since the core design of Cassandra is that all operations should be blazingly fast and to refrain from offering slow features.
> 
> I would recommend that your application:
> 
> 1. Read the existing user data
> 2. Create a new user, using the existing user data.
> 3. Update the old user row to indicate that it is no longer a valid user. Actually, you will have to decide on an application policy for old user names. For example, can they be reused, or are they locked, or... whatever.
> 
> 
> -- Jack Krupansky
> 
>> On Sun, Feb 8, 2015 at 1:48 AM, Ajaya Agrawal <aj...@gmail.com> wrote:
>> 
>>> On Sun, Feb 8, 2015 at 5:03 AM, Eric Stevens <mi...@gmail.com> wrote:
>>> I'm struggling to think of a model where it makes sense to update a primary key as a typical operation.  It suggests, as Adil said, that you may be reasoning wrong about your data model.  Maybe you can explain your problem in more detail - what kind of thing has you updating your PK on a regular basis?
>> 
>> I have a 'user' table which has a column called 'user_name' and other columns like name, city etc. The application requires that user_name be unique and user should be searchable by 'user_name'. The only way to do this in C* would be to make user_name column primary key. Things get trickier when there is a requirement which says that user_name can be changed by the users of the application. This a distributed application which mean that it runs on multiple nodes. If I have to change user_name atomically then either I need to implement distributed locking or use something C* provides.   
> 

Re: Mutable primary key in a table

Posted by Jack Krupansky <ja...@gmail.com>.
What is your full primary key? Specifically, what is the partition key, as
opposed to clustering columns?

The point is that the partition key for a row is hashed to determine the
token for the partition, which in turn determines which node of the cluster
owns that partition. Changing the partition key means that potentially the
partition would need to be "moved" to another node, which is clearly not
something that Cassandra would do since the core design of Cassandra is
that all operations should be blazingly fast and to refrain from offering
slow features.

I would recommend that your application:

1. Read the existing user data
2. Create a new user, using the existing user data.
3. Update the old user row to indicate that it is no longer a valid user.
Actually, you will have to decide on an application policy for old user
names. For example, can they be reused, or are they locked, or... whatever.


-- Jack Krupansky

On Sun, Feb 8, 2015 at 1:48 AM, Ajaya Agrawal <aj...@gmail.com> wrote:

>
> On Sun, Feb 8, 2015 at 5:03 AM, Eric Stevens <mi...@gmail.com> wrote:
>
>> I'm struggling to think of a model where it makes sense to update a
>> primary key as a typical operation.  It suggests, as Adil said, that you
>> may be reasoning wrong about your data model.  Maybe you can explain your
>> problem in more detail - what kind of thing has you updating your PK on a
>> regular basis?
>>
>> I have a 'user' table which has a column called 'user_name' and other
> columns like name, city etc. The application requires that user_name be
> unique and user should be searchable by 'user_name'. The only way to do
> this in C* would be to make user_name column primary key. Things get
> trickier when there is a requirement which says that user_name can be
> changed by the users of the application. This a distributed application
> which mean that it runs on multiple nodes. If I have to change user_name
> atomically then either I need to implement distributed locking or use
> something C* provides.
>
>

Re: Mutable primary key in a table

Posted by Ajaya Agrawal <aj...@gmail.com>.
On Sun, Feb 8, 2015 at 5:03 AM, Eric Stevens <mi...@gmail.com> wrote:

> I'm struggling to think of a model where it makes sense to update a
> primary key as a typical operation.  It suggests, as Adil said, that you
> may be reasoning wrong about your data model.  Maybe you can explain your
> problem in more detail - what kind of thing has you updating your PK on a
> regular basis?
>
> I have a 'user' table which has a column called 'user_name' and other
columns like name, city etc. The application requires that user_name be
unique and user should be searchable by 'user_name'. The only way to do
this in C* would be to make user_name column primary key. Things get
trickier when there is a requirement which says that user_name can be
changed by the users of the application. This a distributed application
which mean that it runs on multiple nodes. If I have to change user_name
atomically then either I need to implement distributed locking or use
something C* provides.

Re: Mutable primary key in a table

Posted by Eric Stevens <mi...@gmail.com>.
I'm struggling to think of a model where it makes sense to update a primary
key as a typical operation.  It suggests, as Adil said, that you may be
reasoning wrong about your data model.  Maybe you can explain your problem
in more detail - what kind of thing has you updating your PK on a regular
basis?

Logged batches do get you pretty close to being able to accomplish what you
need, but as you pointed out, they are a higher cost to your cluster (they
involve a Paxos round, and generally this is about 3x the cost as a typical
update).

On Fri, Feb 6, 2015 at 11:19 PM, Ajaya Agrawal <aj...@gmail.com> wrote:

> On Fri, Feb 6, 2015 at 7:43 PM, Adil <ad...@gmail.com> wrote:
>
>> Hi,
>> it seems you are doing some thing wrong in your model, why can you go
>> with updating columns of key1 instead of deleting/inserting row key?
>>
>
> You can't update a primary key column with update statements.
>
>

Re: Mutable primary key in a table

Posted by Ajaya Agrawal <aj...@gmail.com>.
On Fri, Feb 6, 2015 at 7:43 PM, Adil <ad...@gmail.com> wrote:

> Hi,
> it seems you are doing some thing wrong in your model, why can you go with
> updating columns of key1 instead of deleting/inserting row key?
>

You can't update a primary key column with update statements.

Re: Mutable primary key in a table

Posted by Adil <ad...@gmail.com>.
Hi,
it seems you are doing some thing wrong in your model, why can you go with
updating columns of key1 instead of deleting/inserting row key?

2015-02-06 15:02 GMT+01:00 Ajaya Agrawal <aj...@gmail.com>:

> Hi guys,
>
> I want to take a row with primary key K1, rewrite it with primary key K2,
> and delete the original data with key K1, atomically.
>
> It seems like the only solution which won't have race conditions is to use
> batch statement to delete the old row and insert the new one. But the
> documentation of batch operation makes me nervous. The specific parts in
> docs are the ones which say that all nodes in your cluster become stressed
> if you use logged batches(default one).
>
> Is it a solved problem already?
> Cheers,
> Ajaya
>