You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by ZeroUno <ze...@gmail.com> on 2015/06/04 14:39:08 UTC

sstableloader usage doubts

Hi,
while defining backup and restore procedures for a Cassandra cluster I'm
trying to use sstableloader for restoring a snapshot from a backup, but
I'm not sure I fully understand the documentation on how it should be used.

Looking at the examples in the doc at
http://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsBulkloader_t.html
it seems like the path_to_keyspace to be passed as an argument is
exactly the cassandra data directory. So, you already move the data in
the final target location and then again stream it to the cluster?

Let's do a step back. My cluster is composed of two data centers. Each
data center has two nodes (nodeA1, nodeA2 for center A, nodeB1, nodeB2
for center B).
I'm using NetworkTopologyStrategy with RF=2.

For doing periodic backups I'm creating a snapshot on two nodes
simultaneously in a single data center (nodeA1 and nodeA2), and then
moving the snapshot files in a safe place.
To simulate a disaster recovery situation, I truncate all tables to
erase data (but not the schema which would be re-created anyway by my
application), I stop cassandra on all 4 nodes, I move the snapshot
backup files in their original locations (e.g.
/mydatapath/cassandra/data/mykeyspace/mytable1/) on nodeA1 and nodeA2,
then I restart cassandra on all 4 nodes.

At last, I run:

> sstableloader -d nodeA1,nodeA2,nodeB1,nodeB2 /mydatapath/cassandra/data/mykeyspace/mytable1/
> sstableloader -d nodeA1,nodeA2,nodeB1,nodeB2 /mydatapath/cassandra/data/mykeyspace/mytable2/
> sstableloader -d nodeA1,nodeA2,nodeB1,nodeB2 /mydatapath/cassandra/data/mykeyspace/mytable3/
> [...and so on for all tables]

...on both nodeA1 and nodeA2, where I restored the snapshot.

Is that correct?

I observed some strange behaviour after doing this: when I truncated
tables again, a select count(*) on one of the A nodes still returned a
non-zero number, as if data was still there.
I started thinking that maybe the source sstable directory for
sstableloader should not be the data directory itself, as this causes
some kind if "double data" problem...

Can anyone please tell me if this is the correct way to proceed?
Thank you very much!

--
01

Re: sstableloader usage doubts

Posted by Robert Coli <rc...@eventbrite.com>.

On Tue, Jun 9, 2015 at 1:48 AM, ZeroUno <ze...@gmail.com> wrote:

> As far as I read from the docs, bootstrapping happens when adding a new
> node to the cluster, but in my situation the nodes already exist, I'm only
> adding data back into them.
>

If you don't have the contents of the system keyspace, there is a non-zero
chance of you bootstrapping in some cases.

> Also I have all 4 nodes configured as seeds in cassandra.yaml, so if I'm
> not wrong this should prevent them from auto-bootstrapping.
>

Yes.

=Rob

Re: sstableloader usage doubts

Posted by ZeroUno <ze...@gmail.com>.

Il 08/06/15 20:11, Robert Coli ha scritto:

> On Mon, Jun 8, 2015 at 6:58 AM, ZeroUno <zerozerounouno@gmail.com
> <ma...@gmail.com>> wrote:
>
>     So... if I stop the two nodes on the first DC, restore their
>     sstables' files, and then restart the nodes, nothing else needs to
>     be done on the first DC?
>
> Be careful to avoid bootstrapping, but yes.

What do you mean?
As far as I read from the docs, bootstrapping happens when adding a new 
node to the cluster, but in my situation the nodes already exist, I'm 
only adding data back into them.

Also I have all 4 nodes configured as seeds in cassandra.yaml, so if I'm 
not wrong this should prevent them from auto-bootstrapping.

Thanks.

Marco

-- 
01

Re: sstableloader usage doubts

Posted by Robert Coli <rc...@eventbrite.com>.

On Mon, Jun 8, 2015 at 6:58 AM, ZeroUno <ze...@gmail.com> wrote:

> So you mean that "refresh" needs to be used if the cluster is running, but
> if I stopped cassandra while copying the sstables then refresh is useless?
> So the error "No new SSTables were found" during my refresh attempt is due
> to the fact that the sstables in my data dir were not "new" because already
> loaded, and not to the files not being found?
>

Yes. You should be able to see logs of it opening the files it finds in the
data dir.


> So... if I stop the two nodes on the first DC, restore their sstables'
> files, and then restart the nodes, nothing else needs to be done on the
> first DC?
>

Be careful to avoid bootstrapping, but yes.


> And on the second DC instead I just need to do "nodetool rebuild --
> FirstDC" on _both_ nodes?


Yes.

=Rob

Re: sstableloader usage doubts

Posted by ZeroUno <ze...@gmail.com>.

Il 05/06/15 22:40, Robert Coli ha scritto:

> On Fri, Jun 5, 2015 at 7:53 AM, Sebastian Estevez
> <sebastian.estevez@datastax.com <ma...@datastax.com>>
> wrote:
>
>     Since you only restored one dc's sstables, you should be able to
>     rebuild them on the second DC.
>
>     Refresh means pick up new SSTables that have been directly added to
>     the data directory.
>
>     Rebuild means stream data from other replicas to re create SSTables
>     from scratch.
>
> Sebastian's response is correct; use rebuild. Sorry that I missed that
> specific aspect of your question!

Thank you both.

So you mean that "refresh" needs to be used if the cluster is running, 
but if I stopped cassandra while copying the sstables then refresh is 
useless? So the error "No new SSTables were found" during my refresh 
attempt is due to the fact that the sstables in my data dir were not 
"new" because already loaded, and not to the files not being found?

So... if I stop the two nodes on the first DC, restore their sstables' 
files, and then restart the nodes, nothing else needs to be done on the 
first DC?

And on the second DC instead I just need to do "nodetool rebuild -- 
FirstDC" on _both_ nodes?

-- 
01

Re: sstableloader usage doubts

Posted by Robert Coli <rc...@eventbrite.com>.

On Fri, Jun 5, 2015 at 7:53 AM, Sebastian Estevez <
sebastian.estevez@datastax.com> wrote:

> Since you only restored one dc's sstables, you should be able to rebuild
> them on the second DC.
>
> Refresh means pick up new SSTables that have been directly added to the
> data directory.
>
> Rebuild means stream data from other replicas to re create SSTables from
> scratch.
>

Sebastian's response is correct; use rebuild. Sorry that I missed that
specific aspect of your question!

=Rob

Re: sstableloader usage doubts

Posted by Sebastian Estevez <se...@datastax.com>.

Since you only restored one dc's sstables, you should be able to rebuild
them on the second DC.

Refresh means pick up new SSTables that have been directly added to the
data directory.

Rebuild means stream data from other replicas to re create SSTables from
scratch.
On Jun 5, 2015 6:40 AM, "ZeroUno" <ze...@gmail.com> wrote:

> Il 04/06/15 19:50, Robert Coli ha scritto:
>
>  http://www.pythian.com/blog/bulk-loading-options-for-cassandra/
>>
>
> Thank you Rob, but actually it doesn't matter to me which method is used,
> I can use both nodetool refresh or sstableloader, as long as they work! ;-)
>
> My problem here is that it looks like all my various attempts are failing,
> one way or another (see also my reply to Sebastian).
>
> Marco.
>
> --
> 01
>
>

Re: sstableloader usage doubts

Posted by ZeroUno <ze...@gmail.com>.

Il 04/06/15 19:50, Robert Coli ha scritto:

> http://www.pythian.com/blog/bulk-loading-options-for-cassandra/

Thank you Rob, but actually it doesn't matter to me which method is 
used, I can use both nodetool refresh or sstableloader, as long as they 
work! ;-)

My problem here is that it looks like all my various attempts are 
failing, one way or another (see also my reply to Sebastian).

Marco.

-- 
01

Re: sstableloader usage doubts

Posted by Robert Coli <rc...@eventbrite.com>.

On Thu, Jun 4, 2015 at 5:39 AM, ZeroUno <ze...@gmail.com> wrote:

> while defining backup and restore procedures for a Cassandra cluster I'm
> trying to use sstableloader for restoring a snapshot from a backup, but I'm
> not sure I fully understand the documentation on how it should be used.
>

http://www.pythian.com/blog/bulk-loading-options-for-cassandra/

=Rob

Re: sstableloader usage doubts

Posted by ZeroUno <ze...@gmail.com>.

Il 04/06/15 17:17, Sebastian Estevez ha scritto:

> If you have all the sstables for each node and no token range changes,
> you can just move the sstables to their spot in the data directory
> (rsync or w/e) and bring up your nodes. If you're already up you can use
> nodetool refresh to load the sstables.

Hi, as previously described, in my situation I have the sstables for 
only TWO of my four nodes, i.e. I have a backup of one datacenter only.

I tried stopping cassandra on all four nodes, copying the sstables to 
their original location on the two nodes for which I have a backup, and 
restarting cassandra on all four nodes, but the data did not propagate 
to the second datacenter: the two nodes were I restored the backup 
appeared to be OK, but the two nodes in the other datacenter remained empty.
Am I missing anything?

Also, I tried nodetool refresh with no success.
First of all, on which nodes should I run it?
I tried running it on the nodes were I restored the sstables, but it 
exited without any output and in the log I could see "No new SSTables 
were found for <mykeyspace>/<mytablename>", it didn't do anything.
I'm pretty sure I restored the data in the right place, not in the 
snapshot subdirs.

Thanks.

-- 
01

Re: sstableloader usage doubts

Posted by Sebastian Estevez <se...@datastax.com>.

You don't need sstable loader if your topology hasn't changed and you have
all your sstables backed up for each node. SStableloader actually streams
data to all the nodes in a ring (this is what OpsCenter backup restore
does). So you can actually restore to a larger or smaller cluster or a
cluster with different token ranges / vnodes vs. non vnodes etc. It also
requires all your nodes to be up.

If you have all the sstables for each node and no token range changes, you
can just move the sstables to their spot in the data directory (rsync or
w/e) and bring up your nodes. If you're already up you can use nodetool
refresh to load the sstables.

http://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsRefresh.html


All the best,


[image: datastax_logo.png] <http://www.datastax.com/>

Sebastián Estévez

Solutions Architect | 954 905 8615 | sebastian.estevez@datastax.com

[image: linkedin.png] <https://www.linkedin.com/company/datastax> [image:
facebook.png] <https://www.facebook.com/datastax> [image: twitter.png]
<https://twitter.com/datastax> [image: g+.png]
<https://plus.google.com/+Datastax/about>
<http://feeds.feedburner.com/datastax>

<http://cassandrasummit-datastax.com/>

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.

On Thu, Jun 4, 2015 at 5:39 AM, ZeroUno <ze...@gmail.com> wrote:

> Hi,
> while defining backup and restore procedures for a Cassandra cluster I'm
> trying to use sstableloader for restoring a snapshot from a backup, but I'm
> not sure I fully understand the documentation on how it should be used.
>
> Looking at the examples in the doc at
> http://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsBulkloader_t.html
> it seems like the path_to_keyspace to be passed as an argument is exactly
> the cassandra data directory. So, you already move the data in the final
> target location and then again stream it to the cluster?
>
> Let's do a step back. My cluster is composed of two data centers. Each
> data center has two nodes (nodeA1, nodeA2 for center A, nodeB1, nodeB2 for
> center B).
> I'm using NetworkTopologyStrategy with RF=2.
>
> For doing periodic backups I'm creating a snapshot on two nodes
> simultaneously in a single data center (nodeA1 and nodeA2), and then moving
> the snapshot files in a safe place.
> To simulate a disaster recovery situation, I truncate all tables to erase
> data (but not the schema which would be re-created anyway by my
> application), I stop cassandra on all 4 nodes, I move the snapshot backup
> files in their original locations (e.g.
> /mydatapath/cassandra/data/mykeyspace/mytable1/) on nodeA1 and nodeA2, then
> I restart cassandra on all 4 nodes.
>
> At last, I run:
>
>  sstableloader -d nodeA1,nodeA2,nodeB1,nodeB2
>> /mydatapath/cassandra/data/mykeyspace/mytable1/
>> sstableloader -d nodeA1,nodeA2,nodeB1,nodeB2
>> /mydatapath/cassandra/data/mykeyspace/mytable2/
>> sstableloader -d nodeA1,nodeA2,nodeB1,nodeB2
>> /mydatapath/cassandra/data/mykeyspace/mytable3/
>> [...and so on for all tables]
>>
>
> ...on both nodeA1 and nodeA2, where I restored the snapshot.
>
> Is that correct?
>
> I observed some strange behaviour after doing this: when I truncated
> tables again, a select count(*) on one of the A nodes still returned a
> non-zero number, as if data was still there.
> I started thinking that maybe the source sstable directory for
> sstableloader should not be the data directory itself, as this causes some
> kind if "double data" problem...
>
> Can anyone please tell me if this is the correct way to proceed?
> Thank you very much!
>
> --
> 01
>
>