You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Gaurav Bhatnagar <gb...@gmail.com> on 2014/10/09 09:19:25 UTC

efficiently generate complete database dump in text format

Hi,
   We have a Cassandra database column family containing 320 millions rows
and each row contains about 15 columns. We want to take monthly dump of
this single column family contained in this database in text format.

We are planning to take following approach to implement this functionality
1. Take a snapshot of Cassandra database using nodetool utility. We specify
-cf flag to
     specify column family name so that snapshot contains data
corresponding to a single
     column family.
2. We take backup of this snapshot and move this backup to a separate
physical machine.
3. We using "SStable to json conversion" utility to json convert all the
data files into json
    format.

We have following questions/doubts regarding the above approach
a) Generated json records contains "d" (IS_MARKED_FOR_DELETE) flag in json
record
     and can I safely ignore all such json records?
b) If I ignore all records marked by "d" flag, than can generated json
files in step 3, contain
    duplicate records? I mean do multiple entries for same key.

Do there can be any other better approach to generate data dumps in text
format.

Regards,
Gaurav

Re: efficiently generate complete database dump in text format

Posted by Paulo Ricardo Motta Gomes <pa...@chaordicsystems.com>.

The best way to generate dumps from Cassandra is via Hadoop integration (or
spark). You can find more info here:

http://www.datastax.com/documentation/cassandra/2.1/cassandra/configuration/configHadoop.html
http://wiki.apache.org/cassandra/HadoopSupport

On Thu, Oct 9, 2014 at 4:19 AM, Gaurav Bhatnagar <gb...@gmail.com>
wrote:

> Hi,
>    We have a Cassandra database column family containing 320 millions rows
> and each row contains about 15 columns. We want to take monthly dump of
> this single column family contained in this database in text format.
>
> We are planning to take following approach to implement this functionality
> 1. Take a snapshot of Cassandra database using nodetool utility. We
> specify -cf flag to
>      specify column family name so that snapshot contains data
> corresponding to a single
>      column family.
> 2. We take backup of this snapshot and move this backup to a separate
> physical machine.
> 3. We using "SStable to json conversion" utility to json convert all the
> data files into json
>     format.
>
> We have following questions/doubts regarding the above approach
> a) Generated json records contains "d" (IS_MARKED_FOR_DELETE) flag in json
> record
>      and can I safely ignore all such json records?
> b) If I ignore all records marked by "d" flag, than can generated json
> files in step 3, contain
>     duplicate records? I mean do multiple entries for same key.
>
> Do there can be any other better approach to generate data dumps in text
> format.
>
> Regards,
> Gaurav
>



-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br <http://www.chaordic.com.br/>*
+55 48 3232.3200

Re: efficiently generate complete database dump in text format

Posted by Daniel Chia <da...@coursera.org>.

You might also want to consider tools like
https://github.com/Netflix/aegisthus for the last step, which can help you
deal with tombstones and de-duplicate data.

Thanks,
Daniel

On Thu, Oct 9, 2014 at 12:19 AM, Gaurav Bhatnagar <gb...@gmail.com>
wrote:

> Hi,
>    We have a Cassandra database column family containing 320 millions rows
> and each row contains about 15 columns. We want to take monthly dump of
> this single column family contained in this database in text format.
>
> We are planning to take following approach to implement this functionality
> 1. Take a snapshot of Cassandra database using nodetool utility. We
> specify -cf flag to
>      specify column family name so that snapshot contains data
> corresponding to a single
>      column family.
> 2. We take backup of this snapshot and move this backup to a separate
> physical machine.
> 3. We using "SStable to json conversion" utility to json convert all the
> data files into json
>     format.
>
> We have following questions/doubts regarding the above approach
> a) Generated json records contains "d" (IS_MARKED_FOR_DELETE) flag in json
> record
>      and can I safely ignore all such json records?
> b) If I ignore all records marked by "d" flag, than can generated json
> files in step 3, contain
>     duplicate records? I mean do multiple entries for same key.
>
> Do there can be any other better approach to generate data dumps in text
> format.
>
> Regards,
> Gaurav
>