You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Renat Gilfanov <gr...@mail.ru> on 2013/09/07 23:31:21 UTC

Recommended way of data migration

 Hello,

Let's say we have a simple CQL3 table 

CREATE TABLE example (
    id UUID PRIMARY KEY,
    timestamp TIMESTAMP,
    data ASCII
);

And I need to mutate  (for example encrypt) column values in the "data" column for all rows.

What's the recommended approach to perform such migration programatically? 

For me the general approach is:

1. Create another column family
2. extract a batch of records
3. for each extracted record, perform mutation, insert it in the new cf and delete from old one
4. repeat until source cf not empty

Is it correct approach and if yes, how to implement some kind of paging for the step 2?

Re: Recommended way of data migration

Posted by Paulo Motta <pa...@gmail.com>.

That's a good approach. You could also migrate in-place if you're confident
your migration algorithm is correct, but for more safety having another CF
is better.

If you have a huge volume of data to be migrated (millions of rows or
more), I'd suggest you to use Hadoop to perform these migrations (
http://wiki.apache.org/cassandra/HadoopSupport).

If it's only a few rows, then you could do it programmatically via *
get_range_slices* using the language binding of your choice. Below are some
links on how to perform this on Hector or Pycassa:

* Hector:
http://stackoverflow.com/questions/8418448/cassandra-hector-how-to-retrieve-all-rows-of-a-column-family
* Pycassa:
http://pycassa.github.io/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.get_range

I Agree with Edward that you should only delete the rows once you make sure
they were correctly migrated.

2013/9/7 Edward Capriolo <ed...@gmail.com>

> I would do something like you are suggesting. I would not do the delete
> until all the rows are moved. Since writes in cassandra are idempotent you
> can even run the migration process multiple times without harm.
>
>
> On Sat, Sep 7, 2013 at 5:31 PM, Renat Gilfanov <gr...@mail.ru> wrote:
>
>> Hello,
>>
>> Let's say we have a simple CQL3 table
>>
>> CREATE TABLE example (
>>     id UUID PRIMARY KEY,
>>     timestamp TIMESTAMP,
>>     data ASCII
>> );
>>
>> And I need to mutate  (for example encrypt) column values in the "data"
>> column for all rows.
>>
>> What's the recommended approach to perform such migration
>> programatically?
>>
>> For me the general approach is:
>>
>> 1. Create another column family
>> 2. extract a batch of records
>> 3. for each extracted record, perform mutation, insert it in the new cf
>> and delete from old one
>> 4. repeat until source cf not empty
>>
>> Is it correct approach and if yes, how to implement some kind of paging
>> for the step 2?
>>
>
>

-- 
Paulo Ricardo

-- 
European Master in Distributed Computing***
Royal Institute of Technology - KTH
*
*Instituto Superior Técnico - IST*
*http://paulormg.com*

Re: Recommended way of data migration

Posted by Edward Capriolo <ed...@gmail.com>.

I would do something like you are suggesting. I would not do the delete
until all the rows are moved. Since writes in cassandra are idempotent you
can even run the migration process multiple times without harm.


On Sat, Sep 7, 2013 at 5:31 PM, Renat Gilfanov <gr...@mail.ru> wrote:

> Hello,
>
> Let's say we have a simple CQL3 table
>
> CREATE TABLE example (
>     id UUID PRIMARY KEY,
>     timestamp TIMESTAMP,
>     data ASCII
> );
>
> And I need to mutate  (for example encrypt) column values in the "data"
> column for all rows.
>
> What's the recommended approach to perform such migration programatically?
>
> For me the general approach is:
>
> 1. Create another column family
> 2. extract a batch of records
> 3. for each extracted record, perform mutation, insert it in the new cf
> and delete from old one
> 4. repeat until source cf not empty
>
> Is it correct approach and if yes, how to implement some kind of paging
> for the step 2?
>