You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Yuji Ito <yu...@imagine-orb.com> on 2016/10/17 05:41:39 UTC

failure node rejoin

Hi all,

A failure node can rejoin a cluster.
On the node, all data in /var/lib/cassandra were deleted.
Is it normal?

I can reproduce it as below.

cluster:
- C* 2.2.7
- a cluster has node1, 2, 3
- node1 is a seed
- replication_factor: 3

how to:
1) stop C* process and delete all data in /var/lib/cassandra on node2
($sudo rm -rf /var/lib/cassandra/*)
2) stop C* process on node1 and node3
3) restart C* on node1
4) restart C* on node2

nodetool status after 4):
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID
                        Rack
DN  [node3 IP]  ?                 256          100.0%
 325553c6-3e05-41f6-a1f7-47436743816f  rack1
UN  [node2 IP]  7.76 MB      256          100.0%
 05bdb1d4-c39b-48f1-8248-911d61935925  rack1
UN  [node1 IP]  416.13 MB  256          100.0%
 a8ec0a31-cb92-44b0-b156-5bcd4f6f2c7b  rack1

If I restart C* on node 2 when C* on node1 and node3 are running (without
2), 3)), a runtime exception happens.
RuntimeException: "A node with address [node2 IP] already exists,
cancelling join..."

I'm not sure this causes data lost. All data can be read properly just
after this rejoin.
But some rows are lost when I kill&restart C* for destructive tests after
this rejoin.

Thanks.

Re: failure node rejoin

Posted by Yuji Ito <yu...@imagine-orb.com>.

Thanks Ben,

I've reported this issue at https://issues.apache.org/j
ira/browse/CASSANDRA-12955.

I'll tell you if I find anything about the data loss issue.

Regards,
yuji


On Thu, Nov 24, 2016 at 1:37 PM, Ben Slater <be...@instaclustr.com>
wrote:

> You could certainly log a JIRA for the “failure node rejoin” issue (
> https://issues.apache.org/*jira*/browse/
> <https://issues.apache.org/jira/browse/>*cassandra
> <https://issues.apache.org/jira/browse/cassandra>). I*t sounds like
> unexpected behaviour to me. However, I’m not sure it will be viewed a high
> priority to fix given there is a clear operational work-around.
>
> Cheers
> Ben
>
> On Thu, 24 Nov 2016 at 15:14 Yuji Ito <yu...@imagine-orb.com> wrote:
>
>> Hi Ben,
>>
>> I continue to investigate the data loss issue.
>> I'm investigating logs and source code and try to reproduce the data loss
>> issue with a simple test.
>> I also try my destructive test with DROP instead of TRUNCATE.
>>
>> BTW, I want to discuss the issue of the title "failure node rejoin" again.
>>
>> Will this issue be fixed? Other nodes should refuse this unexpected
>> rejoin.
>> Or should I be more careful to add failure nodes to the existing cluster?
>>
>> Thanks,
>> yuji
>>
>>
>> On Fri, Nov 11, 2016 at 1:00 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> From a quick look I couldn’t find any defects other than the ones you’ve
>> found that seem potentially relevant to your issue (if any one else on the
>> list knows of one please chime in). Maybe the next step, if you haven’t
>> done so already, is to check your Cassandra logs for any signs of issues
>> (ie WARNING or ERROR logs) in the failing case.
>>
>> Cheers
>> Ben
>>
>> On Fri, 11 Nov 2016 at 13:07 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Thanks Ben,
>>
>> I tried 2.2.8 and could reproduce the problem.
>> So, I'm investigating some bug fixes of repair and commitlog between
>> 2.2.8 and 3.0.9.
>>
>> - CASSANDRA-12508: "nodetool repair returns status code 0 for some errors"
>>
>> - CASSANDRA-12436: "Under some races commit log may incorrectly think it
>> has unflushed data"
>>   - related to CASSANDRA-9669, CASSANDRA-11828 (the fix of 2.2 is
>> different from that of 3.0?)
>>
>> Do you know other bug fixes related to commitlog?
>>
>> Regards
>> yuji
>>
>> On Wed, Nov 9, 2016 at 11:34 AM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> There have been a few commit log bugs around in the last couple of months
>> so perhaps you’ve hit something that was fixed recently. Would be
>> interesting to know the problem is still occurring in 2.2.8.
>>
>> I suspect what is happening is that when you do your initial read
>> (without flush) to check the number of rows, the data is in memtables and
>> theoretically the commitlogs but not sstables. With the forced stop the
>> memtables are lost and Cassandra should read the commitlog from disk at
>> startup to reconstruct the memtables. However, it looks like that didn’t
>> happen for some (bad) reason.
>>
>> Good news that 3.0.9 fixes the problem so up to you if you want to
>> investigate further and see if you can narrow it down to file a JIRA
>> (although the first step of that would be trying 2.2.9 to make sure it’s
>> not already fixed there).
>>
>> Cheers
>> Ben
>>
>> On Wed, 9 Nov 2016 at 12:56 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> I tried C* 3.0.9 instead of 2.2.
>> The data lost problem hasn't happen for now (without `nodetool flush`).
>>
>> Thanks
>>
>> On Fri, Nov 4, 2016 at 3:50 PM, Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Thanks Ben,
>>
>> When I added `nodetool flush` on all nodes after step 2, the problem
>> didn't happen.
>> Did replay from old commit logs delete rows?
>>
>> Perhaps, the flush operation just detected that some nodes were down in
>> step 2 (just after truncating tables).
>> (Insertion and check in step2 would succeed if one node was down because
>> consistency levels was serial.
>> If the flush failed on more than one node, the test would retry step 2.)
>> However, if so, the problem would happen without deleting Cassandra data.
>>
>> Regards,
>> yuji
>>
>>
>> On Mon, Oct 24, 2016 at 8:37 AM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> Definitely sounds to me like something is not working as expected but I
>> don’t really have any idea what would cause that (other than the fairly
>> extreme failure scenario). A couple of things I can think of to try to
>> narrow it down:
>> 1) Run nodetool flush on all nodes after step 2 - that will make sure all
>> data is written to sstables rather than relying on commit logs
>> 2) Run the test with consistency level quorom rather than serial
>> (shouldn’t be any different but quorom is more widely used so maybe there
>> is a bug that’s specific to serial)
>>
>> Cheers
>> Ben
>>
>> On Mon, 24 Oct 2016 at 10:29 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Hi Ben,
>>
>> The test without killing nodes has been working well without data lost.
>> I've repeated my test about 200 times after removing data and
>> rebuild/repair.
>>
>> Regards,
>>
>>
>> On Fri, Oct 21, 2016 at 3:14 PM, Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> > Just to confirm, are you saying:
>> > a) after operation 2, you select all and get 1000 rows
>> > b) after operation 3 (which only does updates and read) you select and
>> only get 953 rows?
>>
>> That's right!
>>
>> I've started the test without killing nodes.
>> I'll report the result to you next Monday.
>>
>> Thanks
>>
>>
>> On Fri, Oct 21, 2016 at 3:05 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> Just to confirm, are you saying:
>> a) after operation 2, you select all and get 1000 rows
>> b) after operation 3 (which only does updates and read) you select and
>> only get 953 rows?
>>
>> If so, that would be very unexpected. If you run your tests without
>> killing nodes do you get the expected (1,000) rows?
>>
>> Cheers
>> Ben
>>
>> On Fri, 21 Oct 2016 at 17:00 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> > Are you certain your tests don’t generate any overlapping inserts (by
>> PK)?
>>
>> Yes. The operation 2) also checks the number of rows just after all
>> insertions.
>>
>>
>> On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> OK. Are you certain your tests don’t generate any overlapping inserts (by
>> PK)? Cassandra basically treats any inserts with the same primary key as
>> updates (so 1000 insert operations may not necessarily result in 1000 rows
>> in the DB).
>>
>> On Fri, 21 Oct 2016 at 16:30 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> thanks Ben,
>>
>> > 1) At what stage did you have (or expect to have) 1000 rows (and have
>> the mismatch between actual and expected) - at that end of operation (2) or
>> after operation (3)?
>>
>> after operation 3), at operation 4) which reads all rows by cqlsh with
>> CL.SERIAL
>>
>> > 2) What replication factor and replication strategy is used by the test
>> keyspace? What consistency level is used by your operations?
>>
>> - create keyspace testkeyspace WITH REPLICATION =
>> {'class':'SimpleStrategy','replication_factor':3};
>> - consistency level is SERIAL
>>
>>
>> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>>
>> A couple of questions:
>> 1) At what stage did you have (or expect to have) 1000 rows (and have the
>> mismatch between actual and expected) - at that end of operation (2) or
>> after operation (3)?
>> 2) What replication factor and replication strategy is used by the test
>> keyspace? What consistency level is used by your operations?
>>
>>
>> Cheers
>> Ben
>>
>> ...
>
> [Message clipped]

Re: failure node rejoin

Posted by Ben Slater <be...@instaclustr.com>.

You could certainly log a JIRA for the “failure node rejoin” issue (
https://issues.apache.org/*jira*/browse/
<https://issues.apache.org/jira/browse/>*cassandra
<https://issues.apache.org/jira/browse/cassandra>). I*t sounds like
unexpected behaviour to me. However, I’m not sure it will be viewed a high
priority to fix given there is a clear operational work-around.

Cheers
Ben

On Thu, 24 Nov 2016 at 15:14 Yuji Ito <yu...@imagine-orb.com> wrote:

> Hi Ben,
>
> I continue to investigate the data loss issue.
> I'm investigating logs and source code and try to reproduce the data loss
> issue with a simple test.
> I also try my destructive test with DROP instead of TRUNCATE.
>
> BTW, I want to discuss the issue of the title "failure node rejoin" again.
>
> Will this issue be fixed? Other nodes should refuse this unexpected rejoin.
> Or should I be more careful to add failure nodes to the existing cluster?
>
> Thanks,
> yuji
>
>
> On Fri, Nov 11, 2016 at 1:00 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> From a quick look I couldn’t find any defects other than the ones you’ve
> found that seem potentially relevant to your issue (if any one else on the
> list knows of one please chime in). Maybe the next step, if you haven’t
> done so already, is to check your Cassandra logs for any signs of issues
> (ie WARNING or ERROR logs) in the failing case.
>
> Cheers
> Ben
>
> On Fri, 11 Nov 2016 at 13:07 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thanks Ben,
>
> I tried 2.2.8 and could reproduce the problem.
> So, I'm investigating some bug fixes of repair and commitlog between 2.2.8
> and 3.0.9.
>
> - CASSANDRA-12508: "nodetool repair returns status code 0 for some errors"
>
> - CASSANDRA-12436: "Under some races commit log may incorrectly think it
> has unflushed data"
>   - related to CASSANDRA-9669, CASSANDRA-11828 (the fix of 2.2 is
> different from that of 3.0?)
>
> Do you know other bug fixes related to commitlog?
>
> Regards
> yuji
>
> On Wed, Nov 9, 2016 at 11:34 AM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> There have been a few commit log bugs around in the last couple of months
> so perhaps you’ve hit something that was fixed recently. Would be
> interesting to know the problem is still occurring in 2.2.8.
>
> I suspect what is happening is that when you do your initial read (without
> flush) to check the number of rows, the data is in memtables and
> theoretically the commitlogs but not sstables. With the forced stop the
> memtables are lost and Cassandra should read the commitlog from disk at
> startup to reconstruct the memtables. However, it looks like that didn’t
> happen for some (bad) reason.
>
> Good news that 3.0.9 fixes the problem so up to you if you want to
> investigate further and see if you can narrow it down to file a JIRA
> (although the first step of that would be trying 2.2.9 to make sure it’s
> not already fixed there).
>
> Cheers
> Ben
>
> On Wed, 9 Nov 2016 at 12:56 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> I tried C* 3.0.9 instead of 2.2.
> The data lost problem hasn't happen for now (without `nodetool flush`).
>
> Thanks
>
> On Fri, Nov 4, 2016 at 3:50 PM, Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thanks Ben,
>
> When I added `nodetool flush` on all nodes after step 2, the problem
> didn't happen.
> Did replay from old commit logs delete rows?
>
> Perhaps, the flush operation just detected that some nodes were down in
> step 2 (just after truncating tables).
> (Insertion and check in step2 would succeed if one node was down because
> consistency levels was serial.
> If the flush failed on more than one node, the test would retry step 2.)
> However, if so, the problem would happen without deleting Cassandra data.
>
> Regards,
> yuji
>
>
> On Mon, Oct 24, 2016 at 8:37 AM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> Definitely sounds to me like something is not working as expected but I
> don’t really have any idea what would cause that (other than the fairly
> extreme failure scenario). A couple of things I can think of to try to
> narrow it down:
> 1) Run nodetool flush on all nodes after step 2 - that will make sure all
> data is written to sstables rather than relying on commit logs
> 2) Run the test with consistency level quorom rather than serial
> (shouldn’t be any different but quorom is more widely used so maybe there
> is a bug that’s specific to serial)
>
> Cheers
> Ben
>
> On Mon, 24 Oct 2016 at 10:29 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Hi Ben,
>
> The test without killing nodes has been working well without data lost.
> I've repeated my test about 200 times after removing data and
> rebuild/repair.
>
> Regards,
>
>
> On Fri, Oct 21, 2016 at 3:14 PM, Yuji Ito <yu...@imagine-orb.com> wrote:
>
> > Just to confirm, are you saying:
> > a) after operation 2, you select all and get 1000 rows
> > b) after operation 3 (which only does updates and read) you select and
> only get 953 rows?
>
> That's right!
>
> I've started the test without killing nodes.
> I'll report the result to you next Monday.
>
> Thanks
>
>
> On Fri, Oct 21, 2016 at 3:05 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> Just to confirm, are you saying:
> a) after operation 2, you select all and get 1000 rows
> b) after operation 3 (which only does updates and read) you select and
> only get 953 rows?
>
> If so, that would be very unexpected. If you run your tests without
> killing nodes do you get the expected (1,000) rows?
>
> Cheers
> Ben
>
> On Fri, 21 Oct 2016 at 17:00 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> > Are you certain your tests don’t generate any overlapping inserts (by
> PK)?
>
> Yes. The operation 2) also checks the number of rows just after all
> insertions.
>
>
> On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> OK. Are you certain your tests don’t generate any overlapping inserts (by
> PK)? Cassandra basically treats any inserts with the same primary key as
> updates (so 1000 insert operations may not necessarily result in 1000 rows
> in the DB).
>
> On Fri, 21 Oct 2016 at 16:30 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> thanks Ben,
>
> > 1) At what stage did you have (or expect to have) 1000 rows (and have
> the mismatch between actual and expected) - at that end of operation (2) or
> after operation (3)?
>
> after operation 3), at operation 4) which reads all rows by cqlsh with
> CL.SERIAL
>
> > 2) What replication factor and replication strategy is used by the test
> keyspace? What consistency level is used by your operations?
>
> - create keyspace testkeyspace WITH REPLICATION =
> {'class':'SimpleStrategy','replication_factor':3};
> - consistency level is SERIAL
>
>
> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
>
> A couple of questions:
> 1) At what stage did you have (or expect to have) 1000 rows (and have the
> mismatch between actual and expected) - at that end of operation (2) or
> after operation (3)?
> 2) What replication factor and replication strategy is used by the test
> keyspace? What consistency level is used by your operations?
>
>
> Cheers
> Ben
>
> On Fri, 21 Oct 2016 at 13:57 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thanks Ben,
>
> I tried to run a rebuild and repair after the failure node rejoined the
> cluster as a "new" node with -Dcassandra.replace_address_first_boot.
> The failure node could rejoined and I could read all rows successfully.
> (Sometimes a repair failed because the node cannot access other node. If
> it failed, I retried a repair)
>
> But some rows were lost after my destructive test repeated (after about
> 5-6 hours).
> After the test inserted 1000 rows, there were only 953 rows at the end of
> the test.
>
>

Re: failure node rejoin

Posted by Yuji Ito <yu...@imagine-orb.com>.

Hi Ben,

I continue to investigate the data loss issue.
I'm investigating logs and source code and try to reproduce the data loss
issue with a simple test.
I also try my destructive test with DROP instead of TRUNCATE.

BTW, I want to discuss the issue of the title "failure node rejoin" again.

Will this issue be fixed? Other nodes should refuse this unexpected rejoin.
Or should I be more careful to add failure nodes to the existing cluster?

Thanks,
yuji


On Fri, Nov 11, 2016 at 1:00 PM, Ben Slater <be...@instaclustr.com>
wrote:

> From a quick look I couldn’t find any defects other than the ones you’ve
> found that seem potentially relevant to your issue (if any one else on the
> list knows of one please chime in). Maybe the next step, if you haven’t
> done so already, is to check your Cassandra logs for any signs of issues
> (ie WARNING or ERROR logs) in the failing case.
>
> Cheers
> Ben
>
> On Fri, 11 Nov 2016 at 13:07 Yuji Ito <yu...@imagine-orb.com> wrote:
>
>> Thanks Ben,
>>
>> I tried 2.2.8 and could reproduce the problem.
>> So, I'm investigating some bug fixes of repair and commitlog between
>> 2.2.8 and 3.0.9.
>>
>> - CASSANDRA-12508: "nodetool repair returns status code 0 for some errors"
>>
>> - CASSANDRA-12436: "Under some races commit log may incorrectly think it
>> has unflushed data"
>>   - related to CASSANDRA-9669, CASSANDRA-11828 (the fix of 2.2 is
>> different from that of 3.0?)
>>
>> Do you know other bug fixes related to commitlog?
>>
>> Regards
>> yuji
>>
>> On Wed, Nov 9, 2016 at 11:34 AM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> There have been a few commit log bugs around in the last couple of months
>> so perhaps you’ve hit something that was fixed recently. Would be
>> interesting to know the problem is still occurring in 2.2.8.
>>
>> I suspect what is happening is that when you do your initial read
>> (without flush) to check the number of rows, the data is in memtables and
>> theoretically the commitlogs but not sstables. With the forced stop the
>> memtables are lost and Cassandra should read the commitlog from disk at
>> startup to reconstruct the memtables. However, it looks like that didn’t
>> happen for some (bad) reason.
>>
>> Good news that 3.0.9 fixes the problem so up to you if you want to
>> investigate further and see if you can narrow it down to file a JIRA
>> (although the first step of that would be trying 2.2.9 to make sure it’s
>> not already fixed there).
>>
>> Cheers
>> Ben
>>
>> On Wed, 9 Nov 2016 at 12:56 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> I tried C* 3.0.9 instead of 2.2.
>> The data lost problem hasn't happen for now (without `nodetool flush`).
>>
>> Thanks
>>
>> On Fri, Nov 4, 2016 at 3:50 PM, Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Thanks Ben,
>>
>> When I added `nodetool flush` on all nodes after step 2, the problem
>> didn't happen.
>> Did replay from old commit logs delete rows?
>>
>> Perhaps, the flush operation just detected that some nodes were down in
>> step 2 (just after truncating tables).
>> (Insertion and check in step2 would succeed if one node was down because
>> consistency levels was serial.
>> If the flush failed on more than one node, the test would retry step 2.)
>> However, if so, the problem would happen without deleting Cassandra data.
>>
>> Regards,
>> yuji
>>
>>
>> On Mon, Oct 24, 2016 at 8:37 AM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> Definitely sounds to me like something is not working as expected but I
>> don’t really have any idea what would cause that (other than the fairly
>> extreme failure scenario). A couple of things I can think of to try to
>> narrow it down:
>> 1) Run nodetool flush on all nodes after step 2 - that will make sure all
>> data is written to sstables rather than relying on commit logs
>> 2) Run the test with consistency level quorom rather than serial
>> (shouldn’t be any different but quorom is more widely used so maybe there
>> is a bug that’s specific to serial)
>>
>> Cheers
>> Ben
>>
>> On Mon, 24 Oct 2016 at 10:29 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Hi Ben,
>>
>> The test without killing nodes has been working well without data lost.
>> I've repeated my test about 200 times after removing data and
>> rebuild/repair.
>>
>> Regards,
>>
>>
>> On Fri, Oct 21, 2016 at 3:14 PM, Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> > Just to confirm, are you saying:
>> > a) after operation 2, you select all and get 1000 rows
>> > b) after operation 3 (which only does updates and read) you select and
>> only get 953 rows?
>>
>> That's right!
>>
>> I've started the test without killing nodes.
>> I'll report the result to you next Monday.
>>
>> Thanks
>>
>>
>> On Fri, Oct 21, 2016 at 3:05 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> Just to confirm, are you saying:
>> a) after operation 2, you select all and get 1000 rows
>> b) after operation 3 (which only does updates and read) you select and
>> only get 953 rows?
>>
>> If so, that would be very unexpected. If you run your tests without
>> killing nodes do you get the expected (1,000) rows?
>>
>> Cheers
>> Ben
>>
>> On Fri, 21 Oct 2016 at 17:00 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> > Are you certain your tests don’t generate any overlapping inserts (by
>> PK)?
>>
>> Yes. The operation 2) also checks the number of rows just after all
>> insertions.
>>
>>
>> On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> OK. Are you certain your tests don’t generate any overlapping inserts (by
>> PK)? Cassandra basically treats any inserts with the same primary key as
>> updates (so 1000 insert operations may not necessarily result in 1000 rows
>> in the DB).
>>
>> On Fri, 21 Oct 2016 at 16:30 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> thanks Ben,
>>
>> > 1) At what stage did you have (or expect to have) 1000 rows (and have
>> the mismatch between actual and expected) - at that end of operation (2) or
>> after operation (3)?
>>
>> after operation 3), at operation 4) which reads all rows by cqlsh with
>> CL.SERIAL
>>
>> > 2) What replication factor and replication strategy is used by the test
>> keyspace? What consistency level is used by your operations?
>>
>> - create keyspace testkeyspace WITH REPLICATION =
>> {'class':'SimpleStrategy','replication_factor':3};
>> - consistency level is SERIAL
>>
>>
>> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>>
>> A couple of questions:
>> 1) At what stage did you have (or expect to have) 1000 rows (and have the
>> mismatch between actual and expected) - at that end of operation (2) or
>> after operation (3)?
>> 2) What replication factor and replication strategy is used by the test
>> keyspace? What consistency level is used by your operations?
>>
>>
>> Cheers
>> Ben
>>
>> On Fri, 21 Oct 2016 at 13:57 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Thanks Ben,
>>
>> I tried to run a rebuild and repair after the failure node rejoined the
>> cluster as a "new" node with -Dcassandra.replace_address_first_boot.
>> The failure node could rejoined and I could read all rows successfully.
>> (Sometimes a repair failed because the node cannot access other node. If
>> it failed, I retried a repair)
>>
>> But some rows were lost after my destructive test repeated (after about
>> 5-6 hours).
>> After the test inserted 1000 rows, there were only 953 rows at the end of
>> the test.
>>
>> My destructive test:
>> - each C* node is killed & restarted at the random interval (within about
>> 5 min) throughout this test
>> 1) truncate all tables
>> 2) insert initial rows (check if all rows are inserted successfully)
>> 3) request a lot of read/write to random rows for about 30min
>> 4) check all rows
>> If operation 1), 2) or 4) fail due to C* failure, the test retry the
>> operation.
>>
>> Does anyone have the similar problem?
>> What causes data lost?
>> Does the test need any operation when C* node is restarted? (Currently, I
>> just restarted C* process)
>>
>> Regards,
>>
>>
>> ...
>
> [Message clipped]

Re: failure node rejoin

Posted by Ben Slater <be...@instaclustr.com>.

From a quick look I couldn’t find any defects other than the ones you’ve
found that seem potentially relevant to your issue (if any one else on the
list knows of one please chime in). Maybe the next step, if you haven’t
done so already, is to check your Cassandra logs for any signs of issues
(ie WARNING or ERROR logs) in the failing case.

Cheers
Ben

On Fri, 11 Nov 2016 at 13:07 Yuji Ito <yu...@imagine-orb.com> wrote:

> Thanks Ben,
>
> I tried 2.2.8 and could reproduce the problem.
> So, I'm investigating some bug fixes of repair and commitlog between 2.2.8
> and 3.0.9.
>
> - CASSANDRA-12508: "nodetool repair returns status code 0 for some errors"
>
> - CASSANDRA-12436: "Under some races commit log may incorrectly think it
> has unflushed data"
>   - related to CASSANDRA-9669, CASSANDRA-11828 (the fix of 2.2 is
> different from that of 3.0?)
>
> Do you know other bug fixes related to commitlog?
>
> Regards
> yuji
>
> On Wed, Nov 9, 2016 at 11:34 AM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> There have been a few commit log bugs around in the last couple of months
> so perhaps you’ve hit something that was fixed recently. Would be
> interesting to know the problem is still occurring in 2.2.8.
>
> I suspect what is happening is that when you do your initial read (without
> flush) to check the number of rows, the data is in memtables and
> theoretically the commitlogs but not sstables. With the forced stop the
> memtables are lost and Cassandra should read the commitlog from disk at
> startup to reconstruct the memtables. However, it looks like that didn’t
> happen for some (bad) reason.
>
> Good news that 3.0.9 fixes the problem so up to you if you want to
> investigate further and see if you can narrow it down to file a JIRA
> (although the first step of that would be trying 2.2.9 to make sure it’s
> not already fixed there).
>
> Cheers
> Ben
>
> On Wed, 9 Nov 2016 at 12:56 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> I tried C* 3.0.9 instead of 2.2.
> The data lost problem hasn't happen for now (without `nodetool flush`).
>
> Thanks
>
> On Fri, Nov 4, 2016 at 3:50 PM, Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thanks Ben,
>
> When I added `nodetool flush` on all nodes after step 2, the problem
> didn't happen.
> Did replay from old commit logs delete rows?
>
> Perhaps, the flush operation just detected that some nodes were down in
> step 2 (just after truncating tables).
> (Insertion and check in step2 would succeed if one node was down because
> consistency levels was serial.
> If the flush failed on more than one node, the test would retry step 2.)
> However, if so, the problem would happen without deleting Cassandra data.
>
> Regards,
> yuji
>
>
> On Mon, Oct 24, 2016 at 8:37 AM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> Definitely sounds to me like something is not working as expected but I
> don’t really have any idea what would cause that (other than the fairly
> extreme failure scenario). A couple of things I can think of to try to
> narrow it down:
> 1) Run nodetool flush on all nodes after step 2 - that will make sure all
> data is written to sstables rather than relying on commit logs
> 2) Run the test with consistency level quorom rather than serial
> (shouldn’t be any different but quorom is more widely used so maybe there
> is a bug that’s specific to serial)
>
> Cheers
> Ben
>
> On Mon, 24 Oct 2016 at 10:29 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Hi Ben,
>
> The test without killing nodes has been working well without data lost.
> I've repeated my test about 200 times after removing data and
> rebuild/repair.
>
> Regards,
>
>
> On Fri, Oct 21, 2016 at 3:14 PM, Yuji Ito <yu...@imagine-orb.com> wrote:
>
> > Just to confirm, are you saying:
> > a) after operation 2, you select all and get 1000 rows
> > b) after operation 3 (which only does updates and read) you select and
> only get 953 rows?
>
> That's right!
>
> I've started the test without killing nodes.
> I'll report the result to you next Monday.
>
> Thanks
>
>
> On Fri, Oct 21, 2016 at 3:05 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> Just to confirm, are you saying:
> a) after operation 2, you select all and get 1000 rows
> b) after operation 3 (which only does updates and read) you select and
> only get 953 rows?
>
> If so, that would be very unexpected. If you run your tests without
> killing nodes do you get the expected (1,000) rows?
>
> Cheers
> Ben
>
> On Fri, 21 Oct 2016 at 17:00 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> > Are you certain your tests don’t generate any overlapping inserts (by
> PK)?
>
> Yes. The operation 2) also checks the number of rows just after all
> insertions.
>
>
> On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> OK. Are you certain your tests don’t generate any overlapping inserts (by
> PK)? Cassandra basically treats any inserts with the same primary key as
> updates (so 1000 insert operations may not necessarily result in 1000 rows
> in the DB).
>
> On Fri, 21 Oct 2016 at 16:30 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> thanks Ben,
>
> > 1) At what stage did you have (or expect to have) 1000 rows (and have
> the mismatch between actual and expected) - at that end of operation (2) or
> after operation (3)?
>
> after operation 3), at operation 4) which reads all rows by cqlsh with
> CL.SERIAL
>
> > 2) What replication factor and replication strategy is used by the test
> keyspace? What consistency level is used by your operations?
>
> - create keyspace testkeyspace WITH REPLICATION =
> {'class':'SimpleStrategy','replication_factor':3};
> - consistency level is SERIAL
>
>
> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
>
> A couple of questions:
> 1) At what stage did you have (or expect to have) 1000 rows (and have the
> mismatch between actual and expected) - at that end of operation (2) or
> after operation (3)?
> 2) What replication factor and replication strategy is used by the test
> keyspace? What consistency level is used by your operations?
>
>
> Cheers
> Ben
>
> On Fri, 21 Oct 2016 at 13:57 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thanks Ben,
>
> I tried to run a rebuild and repair after the failure node rejoined the
> cluster as a "new" node with -Dcassandra.replace_address_first_boot.
> The failure node could rejoined and I could read all rows successfully.
> (Sometimes a repair failed because the node cannot access other node. If
> it failed, I retried a repair)
>
> But some rows were lost after my destructive test repeated (after about
> 5-6 hours).
> After the test inserted 1000 rows, there were only 953 rows at the end of
> the test.
>
> My destructive test:
> - each C* node is killed & restarted at the random interval (within about
> 5 min) throughout this test
> 1) truncate all tables
> 2) insert initial rows (check if all rows are inserted successfully)
> 3) request a lot of read/write to random rows for about 30min
> 4) check all rows
> If operation 1), 2) or 4) fail due to C* failure, the test retry the
> operation.
>
> Does anyone have the similar problem?
> What causes data lost?
> Does the test need any operation when C* node is restarted? (Currently, I
> just restarted C* process)
>
> Regards,
>
>
> On Tue, Oct 18, 2016 at 2:18 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> OK, that’s a bit more unexpected (to me at least) but I think the solution
> of running a rebuild or repair still applies.
>
> On Tue, 18 Oct 2016 at 15:45 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thanks Ben, Jeff
>
> Sorry that my explanation confused you.
>
>

Re: failure node rejoin

Posted by Yuji Ito <yu...@imagine-orb.com>.

Thanks Ben,

I tried 2.2.8 and could reproduce the problem.
So, I'm investigating some bug fixes of repair and commitlog between 2.2.8
and 3.0.9.

- CASSANDRA-12508: "nodetool repair returns status code 0 for some errors"

- CASSANDRA-12436: "Under some races commit log may incorrectly think it
has unflushed data"
  - related to CASSANDRA-9669, CASSANDRA-11828 (the fix of 2.2 is different
from that of 3.0?)

Do you know other bug fixes related to commitlog?

Regards
yuji

On Wed, Nov 9, 2016 at 11:34 AM, Ben Slater <be...@instaclustr.com>
wrote:

> There have been a few commit log bugs around in the last couple of months
> so perhaps you’ve hit something that was fixed recently. Would be
> interesting to know the problem is still occurring in 2.2.8.
>
> I suspect what is happening is that when you do your initial read (without
> flush) to check the number of rows, the data is in memtables and
> theoretically the commitlogs but not sstables. With the forced stop the
> memtables are lost and Cassandra should read the commitlog from disk at
> startup to reconstruct the memtables. However, it looks like that didn’t
> happen for some (bad) reason.
>
> Good news that 3.0.9 fixes the problem so up to you if you want to
> investigate further and see if you can narrow it down to file a JIRA
> (although the first step of that would be trying 2.2.9 to make sure it’s
> not already fixed there).
>
> Cheers
> Ben
>
> On Wed, 9 Nov 2016 at 12:56 Yuji Ito <yu...@imagine-orb.com> wrote:
>
>> I tried C* 3.0.9 instead of 2.2.
>> The data lost problem hasn't happen for now (without `nodetool flush`).
>>
>> Thanks
>>
>> On Fri, Nov 4, 2016 at 3:50 PM, Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Thanks Ben,
>>
>> When I added `nodetool flush` on all nodes after step 2, the problem
>> didn't happen.
>> Did replay from old commit logs delete rows?
>>
>> Perhaps, the flush operation just detected that some nodes were down in
>> step 2 (just after truncating tables).
>> (Insertion and check in step2 would succeed if one node was down because
>> consistency levels was serial.
>> If the flush failed on more than one node, the test would retry step 2.)
>> However, if so, the problem would happen without deleting Cassandra data.
>>
>> Regards,
>> yuji
>>
>>
>> On Mon, Oct 24, 2016 at 8:37 AM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> Definitely sounds to me like something is not working as expected but I
>> don’t really have any idea what would cause that (other than the fairly
>> extreme failure scenario). A couple of things I can think of to try to
>> narrow it down:
>> 1) Run nodetool flush on all nodes after step 2 - that will make sure all
>> data is written to sstables rather than relying on commit logs
>> 2) Run the test with consistency level quorom rather than serial
>> (shouldn’t be any different but quorom is more widely used so maybe there
>> is a bug that’s specific to serial)
>>
>> Cheers
>> Ben
>>
>> On Mon, 24 Oct 2016 at 10:29 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Hi Ben,
>>
>> The test without killing nodes has been working well without data lost.
>> I've repeated my test about 200 times after removing data and
>> rebuild/repair.
>>
>> Regards,
>>
>>
>> On Fri, Oct 21, 2016 at 3:14 PM, Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> > Just to confirm, are you saying:
>> > a) after operation 2, you select all and get 1000 rows
>> > b) after operation 3 (which only does updates and read) you select and
>> only get 953 rows?
>>
>> That's right!
>>
>> I've started the test without killing nodes.
>> I'll report the result to you next Monday.
>>
>> Thanks
>>
>>
>> On Fri, Oct 21, 2016 at 3:05 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> Just to confirm, are you saying:
>> a) after operation 2, you select all and get 1000 rows
>> b) after operation 3 (which only does updates and read) you select and
>> only get 953 rows?
>>
>> If so, that would be very unexpected. If you run your tests without
>> killing nodes do you get the expected (1,000) rows?
>>
>> Cheers
>> Ben
>>
>> On Fri, 21 Oct 2016 at 17:00 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> > Are you certain your tests don’t generate any overlapping inserts (by
>> PK)?
>>
>> Yes. The operation 2) also checks the number of rows just after all
>> insertions.
>>
>>
>> On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> OK. Are you certain your tests don’t generate any overlapping inserts (by
>> PK)? Cassandra basically treats any inserts with the same primary key as
>> updates (so 1000 insert operations may not necessarily result in 1000 rows
>> in the DB).
>>
>> On Fri, 21 Oct 2016 at 16:30 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> thanks Ben,
>>
>> > 1) At what stage did you have (or expect to have) 1000 rows (and have
>> the mismatch between actual and expected) - at that end of operation (2) or
>> after operation (3)?
>>
>> after operation 3), at operation 4) which reads all rows by cqlsh with
>> CL.SERIAL
>>
>> > 2) What replication factor and replication strategy is used by the test
>> keyspace? What consistency level is used by your operations?
>>
>> - create keyspace testkeyspace WITH REPLICATION =
>> {'class':'SimpleStrategy','replication_factor':3};
>> - consistency level is SERIAL
>>
>>
>> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>>
>> A couple of questions:
>> 1) At what stage did you have (or expect to have) 1000 rows (and have the
>> mismatch between actual and expected) - at that end of operation (2) or
>> after operation (3)?
>> 2) What replication factor and replication strategy is used by the test
>> keyspace? What consistency level is used by your operations?
>>
>>
>> Cheers
>> Ben
>>
>> On Fri, 21 Oct 2016 at 13:57 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Thanks Ben,
>>
>> I tried to run a rebuild and repair after the failure node rejoined the
>> cluster as a "new" node with -Dcassandra.replace_address_first_boot.
>> The failure node could rejoined and I could read all rows successfully.
>> (Sometimes a repair failed because the node cannot access other node. If
>> it failed, I retried a repair)
>>
>> But some rows were lost after my destructive test repeated (after about
>> 5-6 hours).
>> After the test inserted 1000 rows, there were only 953 rows at the end of
>> the test.
>>
>> My destructive test:
>> - each C* node is killed & restarted at the random interval (within about
>> 5 min) throughout this test
>> 1) truncate all tables
>> 2) insert initial rows (check if all rows are inserted successfully)
>> 3) request a lot of read/write to random rows for about 30min
>> 4) check all rows
>> If operation 1), 2) or 4) fail due to C* failure, the test retry the
>> operation.
>>
>> Does anyone have the similar problem?
>> What causes data lost?
>> Does the test need any operation when C* node is restarted? (Currently, I
>> just restarted C* process)
>>
>> Regards,
>>
>>
>> On Tue, Oct 18, 2016 at 2:18 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> OK, that’s a bit more unexpected (to me at least) but I think the
>> solution of running a rebuild or repair still applies.
>>
>> On Tue, 18 Oct 2016 at 15:45 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Thanks Ben, Jeff
>>
>> Sorry that my explanation confused you.
>>
>> Only node1 is the seed node.
>> Node2 whose C* data is deleted is NOT a seed.
>>
>> I restarted the failure node(node2) after restarting the seed node(node1).
>> The restarting node2 succeeded without the exception.
>> (I couldn't restart node2 before restarting node1 as expected.)
>>
>> Regards,
>>
>>
>> On Tue, Oct 18, 2016 at 1:06 PM, Jeff Jirsa <je...@crowdstrike.com>
>> wrote:
>>
>> ...
>
> [Message clipped]

Re: failure node rejoin

Posted by Ben Slater <be...@instaclustr.com>.

There have been a few commit log bugs around in the last couple of months
so perhaps you’ve hit something that was fixed recently. Would be
interesting to know the problem is still occurring in 2.2.8.

I suspect what is happening is that when you do your initial read (without
flush) to check the number of rows, the data is in memtables and
theoretically the commitlogs but not sstables. With the forced stop the
memtables are lost and Cassandra should read the commitlog from disk at
startup to reconstruct the memtables. However, it looks like that didn’t
happen for some (bad) reason.

Good news that 3.0.9 fixes the problem so up to you if you want to
investigate further and see if you can narrow it down to file a JIRA
(although the first step of that would be trying 2.2.9 to make sure it’s
not already fixed there).

Cheers
Ben

On Wed, 9 Nov 2016 at 12:56 Yuji Ito <yu...@imagine-orb.com> wrote:

> I tried C* 3.0.9 instead of 2.2.
> The data lost problem hasn't happen for now (without `nodetool flush`).
>
> Thanks
>
> On Fri, Nov 4, 2016 at 3:50 PM, Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thanks Ben,
>
> When I added `nodetool flush` on all nodes after step 2, the problem
> didn't happen.
> Did replay from old commit logs delete rows?
>
> Perhaps, the flush operation just detected that some nodes were down in
> step 2 (just after truncating tables).
> (Insertion and check in step2 would succeed if one node was down because
> consistency levels was serial.
> If the flush failed on more than one node, the test would retry step 2.)
> However, if so, the problem would happen without deleting Cassandra data.
>
> Regards,
> yuji
>
>
> On Mon, Oct 24, 2016 at 8:37 AM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> Definitely sounds to me like something is not working as expected but I
> don’t really have any idea what would cause that (other than the fairly
> extreme failure scenario). A couple of things I can think of to try to
> narrow it down:
> 1) Run nodetool flush on all nodes after step 2 - that will make sure all
> data is written to sstables rather than relying on commit logs
> 2) Run the test with consistency level quorom rather than serial
> (shouldn’t be any different but quorom is more widely used so maybe there
> is a bug that’s specific to serial)
>
> Cheers
> Ben
>
> On Mon, 24 Oct 2016 at 10:29 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Hi Ben,
>
> The test without killing nodes has been working well without data lost.
> I've repeated my test about 200 times after removing data and
> rebuild/repair.
>
> Regards,
>
>
> On Fri, Oct 21, 2016 at 3:14 PM, Yuji Ito <yu...@imagine-orb.com> wrote:
>
> > Just to confirm, are you saying:
> > a) after operation 2, you select all and get 1000 rows
> > b) after operation 3 (which only does updates and read) you select and
> only get 953 rows?
>
> That's right!
>
> I've started the test without killing nodes.
> I'll report the result to you next Monday.
>
> Thanks
>
>
> On Fri, Oct 21, 2016 at 3:05 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> Just to confirm, are you saying:
> a) after operation 2, you select all and get 1000 rows
> b) after operation 3 (which only does updates and read) you select and
> only get 953 rows?
>
> If so, that would be very unexpected. If you run your tests without
> killing nodes do you get the expected (1,000) rows?
>
> Cheers
> Ben
>
> On Fri, 21 Oct 2016 at 17:00 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> > Are you certain your tests don’t generate any overlapping inserts (by
> PK)?
>
> Yes. The operation 2) also checks the number of rows just after all
> insertions.
>
>
> On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> OK. Are you certain your tests don’t generate any overlapping inserts (by
> PK)? Cassandra basically treats any inserts with the same primary key as
> updates (so 1000 insert operations may not necessarily result in 1000 rows
> in the DB).
>
> On Fri, 21 Oct 2016 at 16:30 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> thanks Ben,
>
> > 1) At what stage did you have (or expect to have) 1000 rows (and have
> the mismatch between actual and expected) - at that end of operation (2) or
> after operation (3)?
>
> after operation 3), at operation 4) which reads all rows by cqlsh with
> CL.SERIAL
>
> > 2) What replication factor and replication strategy is used by the test
> keyspace? What consistency level is used by your operations?
>
> - create keyspace testkeyspace WITH REPLICATION =
> {'class':'SimpleStrategy','replication_factor':3};
> - consistency level is SERIAL
>
>
> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
>
> A couple of questions:
> 1) At what stage did you have (or expect to have) 1000 rows (and have the
> mismatch between actual and expected) - at that end of operation (2) or
> after operation (3)?
> 2) What replication factor and replication strategy is used by the test
> keyspace? What consistency level is used by your operations?
>
>
> Cheers
> Ben
>
> On Fri, 21 Oct 2016 at 13:57 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thanks Ben,
>
> I tried to run a rebuild and repair after the failure node rejoined the
> cluster as a "new" node with -Dcassandra.replace_address_first_boot.
> The failure node could rejoined and I could read all rows successfully.
> (Sometimes a repair failed because the node cannot access other node. If
> it failed, I retried a repair)
>
> But some rows were lost after my destructive test repeated (after about
> 5-6 hours).
> After the test inserted 1000 rows, there were only 953 rows at the end of
> the test.
>
> My destructive test:
> - each C* node is killed & restarted at the random interval (within about
> 5 min) throughout this test
> 1) truncate all tables
> 2) insert initial rows (check if all rows are inserted successfully)
> 3) request a lot of read/write to random rows for about 30min
> 4) check all rows
> If operation 1), 2) or 4) fail due to C* failure, the test retry the
> operation.
>
> Does anyone have the similar problem?
> What causes data lost?
> Does the test need any operation when C* node is restarted? (Currently, I
> just restarted C* process)
>
> Regards,
>
>
> On Tue, Oct 18, 2016 at 2:18 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> OK, that’s a bit more unexpected (to me at least) but I think the solution
> of running a rebuild or repair still applies.
>
> On Tue, 18 Oct 2016 at 15:45 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thanks Ben, Jeff
>
> Sorry that my explanation confused you.
>
> Only node1 is the seed node.
> Node2 whose C* data is deleted is NOT a seed.
>
> I restarted the failure node(node2) after restarting the seed node(node1).
> The restarting node2 succeeded without the exception.
> (I couldn't restart node2 before restarting node1 as expected.)
>
> Regards,
>
>
> On Tue, Oct 18, 2016 at 1:06 PM, Jeff Jirsa <je...@crowdstrike.com>
> wrote:
>
> The unstated "problem" here is that node1 is a seed, which implies
> auto_bootstrap=false (can't bootstrap a seed, so it was almost certainly
> setup to start without bootstrapping).
>
> That means once the data dir is wiped, it's going to start again without a
> bootstrap, and make a single node cluster or join an existing cluster if
> the seed list is valid
>
>
>
> --
> Jeff Jirsa
>
>
> On Oct 17, 2016, at 8:51 PM, Ben Slater < <be...@instaclustr.com>
>
>

Re: failure node rejoin

Posted by Yuji Ito <yu...@imagine-orb.com>.

I tried C* 3.0.9 instead of 2.2.
The data lost problem hasn't happen for now (without `nodetool flush`).

Thanks

On Fri, Nov 4, 2016 at 3:50 PM, Yuji Ito <yu...@imagine-orb.com> wrote:

> Thanks Ben,
>
> When I added `nodetool flush` on all nodes after step 2, the problem
> didn't happen.
> Did replay from old commit logs delete rows?
>
> Perhaps, the flush operation just detected that some nodes were down in
> step 2 (just after truncating tables).
> (Insertion and check in step2 would succeed if one node was down because
> consistency levels was serial.
> If the flush failed on more than one node, the test would retry step 2.)
> However, if so, the problem would happen without deleting Cassandra data.
>
> Regards,
> yuji
>
>
> On Mon, Oct 24, 2016 at 8:37 AM, Ben Slater <be...@instaclustr.com>
> wrote:
>
>> Definitely sounds to me like something is not working as expected but I
>> don’t really have any idea what would cause that (other than the fairly
>> extreme failure scenario). A couple of things I can think of to try to
>> narrow it down:
>> 1) Run nodetool flush on all nodes after step 2 - that will make sure all
>> data is written to sstables rather than relying on commit logs
>> 2) Run the test with consistency level quorom rather than serial
>> (shouldn’t be any different but quorom is more widely used so maybe there
>> is a bug that’s specific to serial)
>>
>> Cheers
>> Ben
>>
>> On Mon, 24 Oct 2016 at 10:29 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>>> Hi Ben,
>>>
>>> The test without killing nodes has been working well without data lost.
>>> I've repeated my test about 200 times after removing data and
>>> rebuild/repair.
>>>
>>> Regards,
>>>
>>>
>>> On Fri, Oct 21, 2016 at 3:14 PM, Yuji Ito <yu...@imagine-orb.com> wrote:
>>>
>>> > Just to confirm, are you saying:
>>> > a) after operation 2, you select all and get 1000 rows
>>> > b) after operation 3 (which only does updates and read) you select and
>>> only get 953 rows?
>>>
>>> That's right!
>>>
>>> I've started the test without killing nodes.
>>> I'll report the result to you next Monday.
>>>
>>> Thanks
>>>
>>>
>>> On Fri, Oct 21, 2016 at 3:05 PM, Ben Slater <be...@instaclustr.com>
>>> wrote:
>>>
>>> Just to confirm, are you saying:
>>> a) after operation 2, you select all and get 1000 rows
>>> b) after operation 3 (which only does updates and read) you select and
>>> only get 953 rows?
>>>
>>> If so, that would be very unexpected. If you run your tests without
>>> killing nodes do you get the expected (1,000) rows?
>>>
>>> Cheers
>>> Ben
>>>
>>> On Fri, 21 Oct 2016 at 17:00 Yuji Ito <yu...@imagine-orb.com> wrote:
>>>
>>> > Are you certain your tests don’t generate any overlapping inserts (by
>>> PK)?
>>>
>>> Yes. The operation 2) also checks the number of rows just after all
>>> insertions.
>>>
>>>
>>> On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater <be...@instaclustr.com>
>>> wrote:
>>>
>>> OK. Are you certain your tests don’t generate any overlapping inserts
>>> (by PK)? Cassandra basically treats any inserts with the same primary key
>>> as updates (so 1000 insert operations may not necessarily result in 1000
>>> rows in the DB).
>>>
>>> On Fri, 21 Oct 2016 at 16:30 Yuji Ito <yu...@imagine-orb.com> wrote:
>>>
>>> thanks Ben,
>>>
>>> > 1) At what stage did you have (or expect to have) 1000 rows (and have
>>> the mismatch between actual and expected) - at that end of operation (2) or
>>> after operation (3)?
>>>
>>> after operation 3), at operation 4) which reads all rows by cqlsh with
>>> CL.SERIAL
>>>
>>> > 2) What replication factor and replication strategy is used by the
>>> test keyspace? What consistency level is used by your operations?
>>>
>>> - create keyspace testkeyspace WITH REPLICATION =
>>> {'class':'SimpleStrategy','replication_factor':3};
>>> - consistency level is SERIAL
>>>
>>>
>>> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater <ben.slater@instaclustr.com
>>> > wrote:
>>>
>>>
>>> A couple of questions:
>>> 1) At what stage did you have (or expect to have) 1000 rows (and have
>>> the mismatch between actual and expected) - at that end of operation (2) or
>>> after operation (3)?
>>> 2) What replication factor and replication strategy is used by the test
>>> keyspace? What consistency level is used by your operations?
>>>
>>>
>>> Cheers
>>> Ben
>>>
>>> On Fri, 21 Oct 2016 at 13:57 Yuji Ito <yu...@imagine-orb.com> wrote:
>>>
>>> Thanks Ben,
>>>
>>> I tried to run a rebuild and repair after the failure node rejoined the
>>> cluster as a "new" node with -Dcassandra.replace_address_first_boot.
>>> The failure node could rejoined and I could read all rows successfully.
>>> (Sometimes a repair failed because the node cannot access other node. If
>>> it failed, I retried a repair)
>>>
>>> But some rows were lost after my destructive test repeated (after about
>>> 5-6 hours).
>>> After the test inserted 1000 rows, there were only 953 rows at the end
>>> of the test.
>>>
>>> My destructive test:
>>> - each C* node is killed & restarted at the random interval (within
>>> about 5 min) throughout this test
>>> 1) truncate all tables
>>> 2) insert initial rows (check if all rows are inserted successfully)
>>> 3) request a lot of read/write to random rows for about 30min
>>> 4) check all rows
>>> If operation 1), 2) or 4) fail due to C* failure, the test retry the
>>> operation.
>>>
>>> Does anyone have the similar problem?
>>> What causes data lost?
>>> Does the test need any operation when C* node is restarted? (Currently,
>>> I just restarted C* process)
>>>
>>> Regards,
>>>
>>>
>>> On Tue, Oct 18, 2016 at 2:18 PM, Ben Slater <be...@instaclustr.com>
>>> wrote:
>>>
>>> OK, that’s a bit more unexpected (to me at least) but I think the
>>> solution of running a rebuild or repair still applies.
>>>
>>> On Tue, 18 Oct 2016 at 15:45 Yuji Ito <yu...@imagine-orb.com> wrote:
>>>
>>> Thanks Ben, Jeff
>>>
>>> Sorry that my explanation confused you.
>>>
>>> Only node1 is the seed node.
>>> Node2 whose C* data is deleted is NOT a seed.
>>>
>>> I restarted the failure node(node2) after restarting the seed
>>> node(node1).
>>> The restarting node2 succeeded without the exception.
>>> (I couldn't restart node2 before restarting node1 as expected.)
>>>
>>> Regards,
>>>
>>>
>>> On Tue, Oct 18, 2016 at 1:06 PM, Jeff Jirsa <je...@crowdstrike.com>
>>> wrote:
>>>
>>> The unstated "problem" here is that node1 is a seed, which implies
>>> auto_bootstrap=false (can't bootstrap a seed, so it was almost certainly
>>> setup to start without bootstrapping).
>>>
>>> That means once the data dir is wiped, it's going to start again without
>>> a bootstrap, and make a single node cluster or join an existing cluster if
>>> the seed list is valid
>>>
>>>
>>>
>>> --
>>> Jeff Jirsa
>>>
>>>
>>> On Oct 17, 2016, at 8:51 PM, Ben Slater <be...@instaclustr.com>
>>> wrote:
>>>
>>> OK, sorry - I think understand what you are asking now.
>>>
>>> However, I’m still a little confused by your description. I think your
>>> scenario is:
>>> 1) Stop C* on all nodes in a cluster (Nodes A,B,C)
>>> 2) Delete all data from Node A
>>> 3) Restart Node A
>>> 4) Restart Node B,C
>>>
>>> Is this correct?
>>>
>>> If so, this isn’t a scenario I’ve tested/seen but I’m not surprised Node
>>> A starts succesfully as there are no running nodes to tell it via gossip
>>> that it shouldn’t start up without the “replaces” flag.
>>>
>>> I think that right way to recover in this scenario is to run a nodetool
>>> rebuild on Node A after the other two nodes are running. You could
>>> theoretically also run a repair (which would be good practice after a weird
>>> failure scenario like this) but rebuild will probably be quicker given you
>>> know all the data needs to be re-streamed.
>>>
>>> ...
>
> [Message clipped]

Re: failure node rejoin

Posted by Yuji Ito <yu...@imagine-orb.com>.

Thanks Ben,

When I added `nodetool flush` on all nodes after step 2, the problem didn't
happen.
Did replay from old commit logs delete rows?

Perhaps, the flush operation just detected that some nodes were down in
step 2 (just after truncating tables).
(Insertion and check in step2 would succeed if one node was down because
consistency levels was serial.
If the flush failed on more than one node, the test would retry step 2.)
However, if so, the problem would happen without deleting Cassandra data.

Regards,
yuji


On Mon, Oct 24, 2016 at 8:37 AM, Ben Slater <be...@instaclustr.com>
wrote:

> Definitely sounds to me like something is not working as expected but I
> don’t really have any idea what would cause that (other than the fairly
> extreme failure scenario). A couple of things I can think of to try to
> narrow it down:
> 1) Run nodetool flush on all nodes after step 2 - that will make sure all
> data is written to sstables rather than relying on commit logs
> 2) Run the test with consistency level quorom rather than serial
> (shouldn’t be any different but quorom is more widely used so maybe there
> is a bug that’s specific to serial)
>
> Cheers
> Ben
>
> On Mon, 24 Oct 2016 at 10:29 Yuji Ito <yu...@imagine-orb.com> wrote:
>
>> Hi Ben,
>>
>> The test without killing nodes has been working well without data lost.
>> I've repeated my test about 200 times after removing data and
>> rebuild/repair.
>>
>> Regards,
>>
>>
>> On Fri, Oct 21, 2016 at 3:14 PM, Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> > Just to confirm, are you saying:
>> > a) after operation 2, you select all and get 1000 rows
>> > b) after operation 3 (which only does updates and read) you select and
>> only get 953 rows?
>>
>> That's right!
>>
>> I've started the test without killing nodes.
>> I'll report the result to you next Monday.
>>
>> Thanks
>>
>>
>> On Fri, Oct 21, 2016 at 3:05 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> Just to confirm, are you saying:
>> a) after operation 2, you select all and get 1000 rows
>> b) after operation 3 (which only does updates and read) you select and
>> only get 953 rows?
>>
>> If so, that would be very unexpected. If you run your tests without
>> killing nodes do you get the expected (1,000) rows?
>>
>> Cheers
>> Ben
>>
>> On Fri, 21 Oct 2016 at 17:00 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> > Are you certain your tests don’t generate any overlapping inserts (by
>> PK)?
>>
>> Yes. The operation 2) also checks the number of rows just after all
>> insertions.
>>
>>
>> On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> OK. Are you certain your tests don’t generate any overlapping inserts (by
>> PK)? Cassandra basically treats any inserts with the same primary key as
>> updates (so 1000 insert operations may not necessarily result in 1000 rows
>> in the DB).
>>
>> On Fri, 21 Oct 2016 at 16:30 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> thanks Ben,
>>
>> > 1) At what stage did you have (or expect to have) 1000 rows (and have
>> the mismatch between actual and expected) - at that end of operation (2) or
>> after operation (3)?
>>
>> after operation 3), at operation 4) which reads all rows by cqlsh with
>> CL.SERIAL
>>
>> > 2) What replication factor and replication strategy is used by the test
>> keyspace? What consistency level is used by your operations?
>>
>> - create keyspace testkeyspace WITH REPLICATION =
>> {'class':'SimpleStrategy','replication_factor':3};
>> - consistency level is SERIAL
>>
>>
>> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>>
>> A couple of questions:
>> 1) At what stage did you have (or expect to have) 1000 rows (and have the
>> mismatch between actual and expected) - at that end of operation (2) or
>> after operation (3)?
>> 2) What replication factor and replication strategy is used by the test
>> keyspace? What consistency level is used by your operations?
>>
>>
>> Cheers
>> Ben
>>
>> On Fri, 21 Oct 2016 at 13:57 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Thanks Ben,
>>
>> I tried to run a rebuild and repair after the failure node rejoined the
>> cluster as a "new" node with -Dcassandra.replace_address_first_boot.
>> The failure node could rejoined and I could read all rows successfully.
>> (Sometimes a repair failed because the node cannot access other node. If
>> it failed, I retried a repair)
>>
>> But some rows were lost after my destructive test repeated (after about
>> 5-6 hours).
>> After the test inserted 1000 rows, there were only 953 rows at the end of
>> the test.
>>
>> My destructive test:
>> - each C* node is killed & restarted at the random interval (within about
>> 5 min) throughout this test
>> 1) truncate all tables
>> 2) insert initial rows (check if all rows are inserted successfully)
>> 3) request a lot of read/write to random rows for about 30min
>> 4) check all rows
>> If operation 1), 2) or 4) fail due to C* failure, the test retry the
>> operation.
>>
>> Does anyone have the similar problem?
>> What causes data lost?
>> Does the test need any operation when C* node is restarted? (Currently, I
>> just restarted C* process)
>>
>> Regards,
>>
>>
>> On Tue, Oct 18, 2016 at 2:18 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> OK, that’s a bit more unexpected (to me at least) but I think the
>> solution of running a rebuild or repair still applies.
>>
>> On Tue, 18 Oct 2016 at 15:45 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Thanks Ben, Jeff
>>
>> Sorry that my explanation confused you.
>>
>> Only node1 is the seed node.
>> Node2 whose C* data is deleted is NOT a seed.
>>
>> I restarted the failure node(node2) after restarting the seed node(node1).
>> The restarting node2 succeeded without the exception.
>> (I couldn't restart node2 before restarting node1 as expected.)
>>
>> Regards,
>>
>>
>> On Tue, Oct 18, 2016 at 1:06 PM, Jeff Jirsa <je...@crowdstrike.com>
>> wrote:
>>
>> The unstated "problem" here is that node1 is a seed, which implies
>> auto_bootstrap=false (can't bootstrap a seed, so it was almost certainly
>> setup to start without bootstrapping).
>>
>> That means once the data dir is wiped, it's going to start again without
>> a bootstrap, and make a single node cluster or join an existing cluster if
>> the seed list is valid
>>
>>
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Oct 17, 2016, at 8:51 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> OK, sorry - I think understand what you are asking now.
>>
>> However, I’m still a little confused by your description. I think your
>> scenario is:
>> 1) Stop C* on all nodes in a cluster (Nodes A,B,C)
>> 2) Delete all data from Node A
>> 3) Restart Node A
>> 4) Restart Node B,C
>>
>> Is this correct?
>>
>> If so, this isn’t a scenario I’ve tested/seen but I’m not surprised Node
>> A starts succesfully as there are no running nodes to tell it via gossip
>> that it shouldn’t start up without the “replaces” flag.
>>
>> I think that right way to recover in this scenario is to run a nodetool
>> rebuild on Node A after the other two nodes are running. You could
>> theoretically also run a repair (which would be good practice after a weird
>> failure scenario like this) but rebuild will probably be quicker given you
>> know all the data needs to be re-streamed.
>>
>> Cheers
>> Ben
>>
>> On Tue, 18 Oct 2016 at 14:03 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Thank you Ben, Yabin
>>
>> I understood the rejoin was illegal.
>> I expected this rejoin would fail with the exception.
>> But I could add the failure node to the cluster without the
>> exception after 2) and 3).
>> I want to know why the rejoin succeeds. Should the exception happen?
>>
>> Regards,
>>
>>
>> ...
>
> [Message clipped]

Re: failure node rejoin

Posted by Ben Slater <be...@instaclustr.com>.

Definitely sounds to me like something is not working as expected but I
don’t really have any idea what would cause that (other than the fairly
extreme failure scenario). A couple of things I can think of to try to
narrow it down:
1) Run nodetool flush on all nodes after step 2 - that will make sure all
data is written to sstables rather than relying on commit logs
2) Run the test with consistency level quorom rather than serial (shouldn’t
be any different but quorom is more widely used so maybe there is a bug
that’s specific to serial)

Cheers
Ben

On Mon, 24 Oct 2016 at 10:29 Yuji Ito <yu...@imagine-orb.com> wrote:

> Hi Ben,
>
> The test without killing nodes has been working well without data lost.
> I've repeated my test about 200 times after removing data and
> rebuild/repair.
>
> Regards,
>
>
> On Fri, Oct 21, 2016 at 3:14 PM, Yuji Ito <yu...@imagine-orb.com> wrote:
>
> > Just to confirm, are you saying:
> > a) after operation 2, you select all and get 1000 rows
> > b) after operation 3 (which only does updates and read) you select and
> only get 953 rows?
>
> That's right!
>
> I've started the test without killing nodes.
> I'll report the result to you next Monday.
>
> Thanks
>
>
> On Fri, Oct 21, 2016 at 3:05 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> Just to confirm, are you saying:
> a) after operation 2, you select all and get 1000 rows
> b) after operation 3 (which only does updates and read) you select and
> only get 953 rows?
>
> If so, that would be very unexpected. If you run your tests without
> killing nodes do you get the expected (1,000) rows?
>
> Cheers
> Ben
>
> On Fri, 21 Oct 2016 at 17:00 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> > Are you certain your tests don’t generate any overlapping inserts (by
> PK)?
>
> Yes. The operation 2) also checks the number of rows just after all
> insertions.
>
>
> On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> OK. Are you certain your tests don’t generate any overlapping inserts (by
> PK)? Cassandra basically treats any inserts with the same primary key as
> updates (so 1000 insert operations may not necessarily result in 1000 rows
> in the DB).
>
> On Fri, 21 Oct 2016 at 16:30 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> thanks Ben,
>
> > 1) At what stage did you have (or expect to have) 1000 rows (and have
> the mismatch between actual and expected) - at that end of operation (2) or
> after operation (3)?
>
> after operation 3), at operation 4) which reads all rows by cqlsh with
> CL.SERIAL
>
> > 2) What replication factor and replication strategy is used by the test
> keyspace? What consistency level is used by your operations?
>
> - create keyspace testkeyspace WITH REPLICATION =
> {'class':'SimpleStrategy','replication_factor':3};
> - consistency level is SERIAL
>
>
> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
>
> A couple of questions:
> 1) At what stage did you have (or expect to have) 1000 rows (and have the
> mismatch between actual and expected) - at that end of operation (2) or
> after operation (3)?
> 2) What replication factor and replication strategy is used by the test
> keyspace? What consistency level is used by your operations?
>
>
> Cheers
> Ben
>
> On Fri, 21 Oct 2016 at 13:57 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thanks Ben,
>
> I tried to run a rebuild and repair after the failure node rejoined the
> cluster as a "new" node with -Dcassandra.replace_address_first_boot.
> The failure node could rejoined and I could read all rows successfully.
> (Sometimes a repair failed because the node cannot access other node. If
> it failed, I retried a repair)
>
> But some rows were lost after my destructive test repeated (after about
> 5-6 hours).
> After the test inserted 1000 rows, there were only 953 rows at the end of
> the test.
>
> My destructive test:
> - each C* node is killed & restarted at the random interval (within about
> 5 min) throughout this test
> 1) truncate all tables
> 2) insert initial rows (check if all rows are inserted successfully)
> 3) request a lot of read/write to random rows for about 30min
> 4) check all rows
> If operation 1), 2) or 4) fail due to C* failure, the test retry the
> operation.
>
> Does anyone have the similar problem?
> What causes data lost?
> Does the test need any operation when C* node is restarted? (Currently, I
> just restarted C* process)
>
> Regards,
>
>
> On Tue, Oct 18, 2016 at 2:18 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> OK, that’s a bit more unexpected (to me at least) but I think the solution
> of running a rebuild or repair still applies.
>
> On Tue, 18 Oct 2016 at 15:45 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thanks Ben, Jeff
>
> Sorry that my explanation confused you.
>
> Only node1 is the seed node.
> Node2 whose C* data is deleted is NOT a seed.
>
> I restarted the failure node(node2) after restarting the seed node(node1).
> The restarting node2 succeeded without the exception.
> (I couldn't restart node2 before restarting node1 as expected.)
>
> Regards,
>
>
> On Tue, Oct 18, 2016 at 1:06 PM, Jeff Jirsa <je...@crowdstrike.com>
> wrote:
>
> The unstated "problem" here is that node1 is a seed, which implies
> auto_bootstrap=false (can't bootstrap a seed, so it was almost certainly
> setup to start without bootstrapping).
>
> That means once the data dir is wiped, it's going to start again without a
> bootstrap, and make a single node cluster or join an existing cluster if
> the seed list is valid
>
>
>
> --
> Jeff Jirsa
>
>
> On Oct 17, 2016, at 8:51 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> OK, sorry - I think understand what you are asking now.
>
> However, I’m still a little confused by your description. I think your
> scenario is:
> 1) Stop C* on all nodes in a cluster (Nodes A,B,C)
> 2) Delete all data from Node A
> 3) Restart Node A
> 4) Restart Node B,C
>
> Is this correct?
>
> If so, this isn’t a scenario I’ve tested/seen but I’m not surprised Node A
> starts succesfully as there are no running nodes to tell it via gossip that
> it shouldn’t start up without the “replaces” flag.
>
> I think that right way to recover in this scenario is to run a nodetool
> rebuild on Node A after the other two nodes are running. You could
> theoretically also run a repair (which would be good practice after a weird
> failure scenario like this) but rebuild will probably be quicker given you
> know all the data needs to be re-streamed.
>
> Cheers
> Ben
>
> On Tue, 18 Oct 2016 at 14:03 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thank you Ben, Yabin
>
> I understood the rejoin was illegal.
> I expected this rejoin would fail with the exception.
> But I could add the failure node to the cluster without the
> exception after 2) and 3).
> I want to know why the rejoin succeeds. Should the exception happen?
>
> Regards,
>
>
> On Tue, Oct 18, 2016 at 1:51 AM, Yabin Meng <ya...@gmail.com> wrote:
>
> The exception you run into is expected behavior. This is because as Ben
> pointed out, when you delete everything (including system schemas), C*
> cluster thinks you're bootstrapping a new node. However,  node2's IP is
> still in gossip and this is why you see the exception.
>
> I'm not clear the reasoning why you need to delete C* data directory. That
> is a dangerous action, especially considering that you delete system
> schemas. If in any case the failure node is gone for a while, what you need
> to do is to is remove the node first before doing "rejoin".
>
> Cheers,
>
> Yabin
>
> On Mon, Oct 17, 2016 at 1:48 AM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> To cassandra, the node where you deleted the files looks like a brand new
> machine. It doesn’t automatically rebuild machines to prevent accidental
> replacement. You need to tell it to build the “new” machines as a
> replacement for the “old” machine with that IP by setting
>
>

Re: failure node rejoin

Posted by Yuji Ito <yu...@imagine-orb.com>.

Hi Ben,

The test without killing nodes has been working well without data lost.
I've repeated my test about 200 times after removing data and
rebuild/repair.

Regards,


On Fri, Oct 21, 2016 at 3:14 PM, Yuji Ito <yu...@imagine-orb.com> wrote:

> > Just to confirm, are you saying:
> > a) after operation 2, you select all and get 1000 rows
> > b) after operation 3 (which only does updates and read) you select and
> only get 953 rows?
>
> That's right!
>
> I've started the test without killing nodes.
> I'll report the result to you next Monday.
>
> Thanks
>
>
> On Fri, Oct 21, 2016 at 3:05 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
>> Just to confirm, are you saying:
>> a) after operation 2, you select all and get 1000 rows
>> b) after operation 3 (which only does updates and read) you select and
>> only get 953 rows?
>>
>> If so, that would be very unexpected. If you run your tests without
>> killing nodes do you get the expected (1,000) rows?
>>
>> Cheers
>> Ben
>>
>> On Fri, 21 Oct 2016 at 17:00 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>>> > Are you certain your tests don’t generate any overlapping inserts (by
>>> PK)?
>>>
>>> Yes. The operation 2) also checks the number of rows just after all
>>> insertions.
>>>
>>>
>>> On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater <be...@instaclustr.com>
>>> wrote:
>>>
>>> OK. Are you certain your tests don’t generate any overlapping inserts
>>> (by PK)? Cassandra basically treats any inserts with the same primary key
>>> as updates (so 1000 insert operations may not necessarily result in 1000
>>> rows in the DB).
>>>
>>> On Fri, 21 Oct 2016 at 16:30 Yuji Ito <yu...@imagine-orb.com> wrote:
>>>
>>> thanks Ben,
>>>
>>> > 1) At what stage did you have (or expect to have) 1000 rows (and have
>>> the mismatch between actual and expected) - at that end of operation (2) or
>>> after operation (3)?
>>>
>>> after operation 3), at operation 4) which reads all rows by cqlsh with
>>> CL.SERIAL
>>>
>>> > 2) What replication factor and replication strategy is used by the
>>> test keyspace? What consistency level is used by your operations?
>>>
>>> - create keyspace testkeyspace WITH REPLICATION =
>>> {'class':'SimpleStrategy','replication_factor':3};
>>> - consistency level is SERIAL
>>>
>>>
>>> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater <ben.slater@instaclustr.com
>>> > wrote:
>>>
>>>
>>> A couple of questions:
>>> 1) At what stage did you have (or expect to have) 1000 rows (and have
>>> the mismatch between actual and expected) - at that end of operation (2) or
>>> after operation (3)?
>>> 2) What replication factor and replication strategy is used by the test
>>> keyspace? What consistency level is used by your operations?
>>>
>>>
>>> Cheers
>>> Ben
>>>
>>> On Fri, 21 Oct 2016 at 13:57 Yuji Ito <yu...@imagine-orb.com> wrote:
>>>
>>> Thanks Ben,
>>>
>>> I tried to run a rebuild and repair after the failure node rejoined the
>>> cluster as a "new" node with -Dcassandra.replace_address_first_boot.
>>> The failure node could rejoined and I could read all rows successfully.
>>> (Sometimes a repair failed because the node cannot access other node. If
>>> it failed, I retried a repair)
>>>
>>> But some rows were lost after my destructive test repeated (after about
>>> 5-6 hours).
>>> After the test inserted 1000 rows, there were only 953 rows at the end
>>> of the test.
>>>
>>> My destructive test:
>>> - each C* node is killed & restarted at the random interval (within
>>> about 5 min) throughout this test
>>> 1) truncate all tables
>>> 2) insert initial rows (check if all rows are inserted successfully)
>>> 3) request a lot of read/write to random rows for about 30min
>>> 4) check all rows
>>> If operation 1), 2) or 4) fail due to C* failure, the test retry the
>>> operation.
>>>
>>> Does anyone have the similar problem?
>>> What causes data lost?
>>> Does the test need any operation when C* node is restarted? (Currently,
>>> I just restarted C* process)
>>>
>>> Regards,
>>>
>>>
>>> On Tue, Oct 18, 2016 at 2:18 PM, Ben Slater <be...@instaclustr.com>
>>> wrote:
>>>
>>> OK, that’s a bit more unexpected (to me at least) but I think the
>>> solution of running a rebuild or repair still applies.
>>>
>>> On Tue, 18 Oct 2016 at 15:45 Yuji Ito <yu...@imagine-orb.com> wrote:
>>>
>>> Thanks Ben, Jeff
>>>
>>> Sorry that my explanation confused you.
>>>
>>> Only node1 is the seed node.
>>> Node2 whose C* data is deleted is NOT a seed.
>>>
>>> I restarted the failure node(node2) after restarting the seed
>>> node(node1).
>>> The restarting node2 succeeded without the exception.
>>> (I couldn't restart node2 before restarting node1 as expected.)
>>>
>>> Regards,
>>>
>>>
>>> On Tue, Oct 18, 2016 at 1:06 PM, Jeff Jirsa <je...@crowdstrike.com>
>>> wrote:
>>>
>>> The unstated "problem" here is that node1 is a seed, which implies
>>> auto_bootstrap=false (can't bootstrap a seed, so it was almost certainly
>>> setup to start without bootstrapping).
>>>
>>> That means once the data dir is wiped, it's going to start again without
>>> a bootstrap, and make a single node cluster or join an existing cluster if
>>> the seed list is valid
>>>
>>>
>>>
>>> --
>>> Jeff Jirsa
>>>
>>>
>>> On Oct 17, 2016, at 8:51 PM, Ben Slater <be...@instaclustr.com>
>>> wrote:
>>>
>>> OK, sorry - I think understand what you are asking now.
>>>
>>> However, I’m still a little confused by your description. I think your
>>> scenario is:
>>> 1) Stop C* on all nodes in a cluster (Nodes A,B,C)
>>> 2) Delete all data from Node A
>>> 3) Restart Node A
>>> 4) Restart Node B,C
>>>
>>> Is this correct?
>>>
>>> If so, this isn’t a scenario I’ve tested/seen but I’m not surprised Node
>>> A starts succesfully as there are no running nodes to tell it via gossip
>>> that it shouldn’t start up without the “replaces” flag.
>>>
>>> I think that right way to recover in this scenario is to run a nodetool
>>> rebuild on Node A after the other two nodes are running. You could
>>> theoretically also run a repair (which would be good practice after a weird
>>> failure scenario like this) but rebuild will probably be quicker given you
>>> know all the data needs to be re-streamed.
>>>
>>> Cheers
>>> Ben
>>>
>>> On Tue, 18 Oct 2016 at 14:03 Yuji Ito <yu...@imagine-orb.com> wrote:
>>>
>>> Thank you Ben, Yabin
>>>
>>> I understood the rejoin was illegal.
>>> I expected this rejoin would fail with the exception.
>>> But I could add the failure node to the cluster without the
>>> exception after 2) and 3).
>>> I want to know why the rejoin succeeds. Should the exception happen?
>>>
>>> Regards,
>>>
>>>
>>> On Tue, Oct 18, 2016 at 1:51 AM, Yabin Meng <ya...@gmail.com> wrote:
>>>
>>> The exception you run into is expected behavior. This is because as Ben
>>> pointed out, when you delete everything (including system schemas), C*
>>> cluster thinks you're bootstrapping a new node. However,  node2's IP is
>>> still in gossip and this is why you see the exception.
>>>
>>> I'm not clear the reasoning why you need to delete C* data directory.
>>> That is a dangerous action, especially considering that you delete system
>>> schemas. If in any case the failure node is gone for a while, what you need
>>> to do is to is remove the node first before doing "rejoin".
>>>
>>> Cheers,
>>>
>>> Yabin
>>>
>>> On Mon, Oct 17, 2016 at 1:48 AM, Ben Slater <be...@instaclustr.com>
>>> wrote:
>>>
>>> To cassandra, the node where you deleted the files looks like a brand
>>> new machine. It doesn’t automatically rebuild machines to prevent
>>> accidental replacement. You need to tell it to build the “new” machines as
>>> a replacement for the “old” machine with that IP by setting
>>> -Dcassandra.replace_address_first_boot=<dead_node_ip>. See
>>> http://cassandra.apache.org/doc/latest/operating/topo_changes.html
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__cassandra.apache.org_doc_latest_operating_topo-5Fchanges.html&d=DQMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KGo0EnUT-Bop-0OnyQJRuFvNOf99S9tWEgziATmNfJ8&s=YazqmnV8TuuQXt9PDn0kFe6C08b7tQQXrqouXBCVVXE&e=>
>>> .
>>>
>>> Cheers
>>> Ben
>>>
>>> On Mon, 17 Oct 2016 at 16:41 Yuji Ito <yu...@imagine-orb.com> wrote:
>>>
>>> Hi all,
>>>
>>> A failure node can rejoin a cluster.
>>> On the node, all data in /var/lib/cassandra were deleted.
>>> Is it normal?
>>>
>>> I can reproduce it as below.
>>>
>>> cluster:
>>> - C* 2.2.7
>>> - a cluster has node1, 2, 3
>>> - node1 is a seed
>>> - replication_factor: 3
>>>
>>> how to:
>>> 1) stop C* process and delete all data in /var/lib/cassandra on node2
>>> ($sudo rm -rf /var/lib/cassandra/*)
>>> 2) stop C* process on node1 and node3
>>> 3) restart C* on node1
>>> 4) restart C* on node2
>>>
>>> nodetool status after 4):
>>> Datacenter: datacenter1
>>> =======================
>>> Status=Up/Down
>>> |/ State=Normal/Leaving/Joining/Moving
>>> --  Address        Load       Tokens       Owns (effective)  Host ID
>>>                           Rack
>>> DN  [node3 IP]  ?                 256          100.0%
>>>  325553c6-3e05-41f6-a1f7-47436743816f  rack1
>>> UN  [node2 IP]  7.76 MB      256          100.0%
>>>  05bdb1d4-c39b-48f1-8248-911d61935925  rack1
>>> UN  [node1 IP]  416.13 MB  256          100.0%
>>>  a8ec0a31-cb92-44b0-b156-5bcd4f6f2c7b  rack1
>>>
>>> If I restart C* on node 2 when C* on node1 and node3 are running
>>> (without 2), 3)), a runtime exception happens.
>>> RuntimeException: "A node with address [node2 IP] already exists,
>>> cancelling join..."
>>>
>>> I'm not sure this causes data lost. All data can be read properly just
>>> after this rejoin.
>>> But some rows are lost when I kill&restart C* for destructive tests
>>> after this rejoin.
>>>
>>> Thanks.
>>>
>>> --
>>> ————————
>>> Ben Slater
>>> Chief Product Officer
>>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>>> +61 437 929 798
>>>
>>>
>>>
>>> --
>>> ————————
>>> Ben Slater
>>> Chief Product Officer
>>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>>> +61 437 929 798
>>>
>>> ____________________________________________________________________
>>> CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential
>>> and may be legally privileged. If you are not the intended recipient, do
>>> not disclose, copy, distribute, or use this email or any attachments. If
>>> you have received this in error please let the sender know and then delete
>>> the email and all attachments.
>>>
>>>
>>> --
>>> ————————
>>> Ben Slater
>>> Chief Product Officer
>>>
>>> --
>> ————————
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
>>
>
>

Re: failure node rejoin

Posted by Yuji Ito <yu...@imagine-orb.com>.

> Just to confirm, are you saying:
> a) after operation 2, you select all and get 1000 rows
> b) after operation 3 (which only does updates and read) you select and
only get 953 rows?

That's right!

I've started the test without killing nodes.
I'll report the result to you next Monday.

Thanks


On Fri, Oct 21, 2016 at 3:05 PM, Ben Slater <be...@instaclustr.com>
wrote:

> Just to confirm, are you saying:
> a) after operation 2, you select all and get 1000 rows
> b) after operation 3 (which only does updates and read) you select and
> only get 953 rows?
>
> If so, that would be very unexpected. If you run your tests without
> killing nodes do you get the expected (1,000) rows?
>
> Cheers
> Ben
>
> On Fri, 21 Oct 2016 at 17:00 Yuji Ito <yu...@imagine-orb.com> wrote:
>
>> > Are you certain your tests don’t generate any overlapping inserts (by
>> PK)?
>>
>> Yes. The operation 2) also checks the number of rows just after all
>> insertions.
>>
>>
>> On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> OK. Are you certain your tests don’t generate any overlapping inserts (by
>> PK)? Cassandra basically treats any inserts with the same primary key as
>> updates (so 1000 insert operations may not necessarily result in 1000 rows
>> in the DB).
>>
>> On Fri, 21 Oct 2016 at 16:30 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> thanks Ben,
>>
>> > 1) At what stage did you have (or expect to have) 1000 rows (and have
>> the mismatch between actual and expected) - at that end of operation (2) or
>> after operation (3)?
>>
>> after operation 3), at operation 4) which reads all rows by cqlsh with
>> CL.SERIAL
>>
>> > 2) What replication factor and replication strategy is used by the test
>> keyspace? What consistency level is used by your operations?
>>
>> - create keyspace testkeyspace WITH REPLICATION =
>> {'class':'SimpleStrategy','replication_factor':3};
>> - consistency level is SERIAL
>>
>>
>> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>>
>> A couple of questions:
>> 1) At what stage did you have (or expect to have) 1000 rows (and have the
>> mismatch between actual and expected) - at that end of operation (2) or
>> after operation (3)?
>> 2) What replication factor and replication strategy is used by the test
>> keyspace? What consistency level is used by your operations?
>>
>>
>> Cheers
>> Ben
>>
>> On Fri, 21 Oct 2016 at 13:57 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Thanks Ben,
>>
>> I tried to run a rebuild and repair after the failure node rejoined the
>> cluster as a "new" node with -Dcassandra.replace_address_first_boot.
>> The failure node could rejoined and I could read all rows successfully.
>> (Sometimes a repair failed because the node cannot access other node. If
>> it failed, I retried a repair)
>>
>> But some rows were lost after my destructive test repeated (after about
>> 5-6 hours).
>> After the test inserted 1000 rows, there were only 953 rows at the end of
>> the test.
>>
>> My destructive test:
>> - each C* node is killed & restarted at the random interval (within about
>> 5 min) throughout this test
>> 1) truncate all tables
>> 2) insert initial rows (check if all rows are inserted successfully)
>> 3) request a lot of read/write to random rows for about 30min
>> 4) check all rows
>> If operation 1), 2) or 4) fail due to C* failure, the test retry the
>> operation.
>>
>> Does anyone have the similar problem?
>> What causes data lost?
>> Does the test need any operation when C* node is restarted? (Currently, I
>> just restarted C* process)
>>
>> Regards,
>>
>>
>> On Tue, Oct 18, 2016 at 2:18 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> OK, that’s a bit more unexpected (to me at least) but I think the
>> solution of running a rebuild or repair still applies.
>>
>> On Tue, 18 Oct 2016 at 15:45 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Thanks Ben, Jeff
>>
>> Sorry that my explanation confused you.
>>
>> Only node1 is the seed node.
>> Node2 whose C* data is deleted is NOT a seed.
>>
>> I restarted the failure node(node2) after restarting the seed node(node1).
>> The restarting node2 succeeded without the exception.
>> (I couldn't restart node2 before restarting node1 as expected.)
>>
>> Regards,
>>
>>
>> On Tue, Oct 18, 2016 at 1:06 PM, Jeff Jirsa <je...@crowdstrike.com>
>> wrote:
>>
>> The unstated "problem" here is that node1 is a seed, which implies
>> auto_bootstrap=false (can't bootstrap a seed, so it was almost certainly
>> setup to start without bootstrapping).
>>
>> That means once the data dir is wiped, it's going to start again without
>> a bootstrap, and make a single node cluster or join an existing cluster if
>> the seed list is valid
>>
>>
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Oct 17, 2016, at 8:51 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> OK, sorry - I think understand what you are asking now.
>>
>> However, I’m still a little confused by your description. I think your
>> scenario is:
>> 1) Stop C* on all nodes in a cluster (Nodes A,B,C)
>> 2) Delete all data from Node A
>> 3) Restart Node A
>> 4) Restart Node B,C
>>
>> Is this correct?
>>
>> If so, this isn’t a scenario I’ve tested/seen but I’m not surprised Node
>> A starts succesfully as there are no running nodes to tell it via gossip
>> that it shouldn’t start up without the “replaces” flag.
>>
>> I think that right way to recover in this scenario is to run a nodetool
>> rebuild on Node A after the other two nodes are running. You could
>> theoretically also run a repair (which would be good practice after a weird
>> failure scenario like this) but rebuild will probably be quicker given you
>> know all the data needs to be re-streamed.
>>
>> Cheers
>> Ben
>>
>> On Tue, 18 Oct 2016 at 14:03 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Thank you Ben, Yabin
>>
>> I understood the rejoin was illegal.
>> I expected this rejoin would fail with the exception.
>> But I could add the failure node to the cluster without the
>> exception after 2) and 3).
>> I want to know why the rejoin succeeds. Should the exception happen?
>>
>> Regards,
>>
>>
>> On Tue, Oct 18, 2016 at 1:51 AM, Yabin Meng <ya...@gmail.com> wrote:
>>
>> The exception you run into is expected behavior. This is because as Ben
>> pointed out, when you delete everything (including system schemas), C*
>> cluster thinks you're bootstrapping a new node. However,  node2's IP is
>> still in gossip and this is why you see the exception.
>>
>> I'm not clear the reasoning why you need to delete C* data directory.
>> That is a dangerous action, especially considering that you delete system
>> schemas. If in any case the failure node is gone for a while, what you need
>> to do is to is remove the node first before doing "rejoin".
>>
>> Cheers,
>>
>> Yabin
>>
>> On Mon, Oct 17, 2016 at 1:48 AM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> To cassandra, the node where you deleted the files looks like a brand new
>> machine. It doesn’t automatically rebuild machines to prevent accidental
>> replacement. You need to tell it to build the “new” machines as a
>> replacement for the “old” machine with that IP by setting
>> -Dcassandra.replace_address_first_boot=<dead_node_ip>. See
>> http://cassandra.apache.org/doc/latest/operating/topo_changes.html
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__cassandra.apache.org_doc_latest_operating_topo-5Fchanges.html&d=DQMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KGo0EnUT-Bop-0OnyQJRuFvNOf99S9tWEgziATmNfJ8&s=YazqmnV8TuuQXt9PDn0kFe6C08b7tQQXrqouXBCVVXE&e=>
>> .
>>
>> Cheers
>> Ben
>>
>> On Mon, 17 Oct 2016 at 16:41 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Hi all,
>>
>> A failure node can rejoin a cluster.
>> On the node, all data in /var/lib/cassandra were deleted.
>> Is it normal?
>>
>> I can reproduce it as below.
>>
>> cluster:
>> - C* 2.2.7
>> - a cluster has node1, 2, 3
>> - node1 is a seed
>> - replication_factor: 3
>>
>> how to:
>> 1) stop C* process and delete all data in /var/lib/cassandra on node2
>> ($sudo rm -rf /var/lib/cassandra/*)
>> 2) stop C* process on node1 and node3
>> 3) restart C* on node1
>> 4) restart C* on node2
>>
>> nodetool status after 4):
>> Datacenter: datacenter1
>> =======================
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address        Load       Tokens       Owns (effective)  Host ID
>>                           Rack
>> DN  [node3 IP]  ?                 256          100.0%
>>  325553c6-3e05-41f6-a1f7-47436743816f  rack1
>> UN  [node2 IP]  7.76 MB      256          100.0%
>>  05bdb1d4-c39b-48f1-8248-911d61935925  rack1
>> UN  [node1 IP]  416.13 MB  256          100.0%
>>  a8ec0a31-cb92-44b0-b156-5bcd4f6f2c7b  rack1
>>
>> If I restart C* on node 2 when C* on node1 and node3 are running (without
>> 2), 3)), a runtime exception happens.
>> RuntimeException: "A node with address [node2 IP] already exists,
>> cancelling join..."
>>
>> I'm not sure this causes data lost. All data can be read properly just
>> after this rejoin.
>> But some rows are lost when I kill&restart C* for destructive tests after
>> this rejoin.
>>
>> Thanks.
>>
>> --
>> ————————
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
>>
>>
>>
>> --
>> ————————
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
>>
>> ____________________________________________________________________
>> CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential
>> and may be legally privileged. If you are not the intended recipient, do
>> not disclose, copy, distribute, or use this email or any attachments. If
>> you have received this in error please let the sender know and then delete
>> the email and all attachments.
>>
>>
>> --
>> ————————
>> Ben Slater
>> Chief Product Officer
>>
>> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>

Re: failure node rejoin

Posted by Ben Slater <be...@instaclustr.com>.

Just to confirm, are you saying:
a) after operation 2, you select all and get 1000 rows
b) after operation 3 (which only does updates and read) you select and only
get 953 rows?

If so, that would be very unexpected. If you run your tests without killing
nodes do you get the expected (1,000) rows?

Cheers
Ben

On Fri, 21 Oct 2016 at 17:00 Yuji Ito <yu...@imagine-orb.com> wrote:

> > Are you certain your tests don’t generate any overlapping inserts (by
> PK)?
>
> Yes. The operation 2) also checks the number of rows just after all
> insertions.
>
>
> On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> OK. Are you certain your tests don’t generate any overlapping inserts (by
> PK)? Cassandra basically treats any inserts with the same primary key as
> updates (so 1000 insert operations may not necessarily result in 1000 rows
> in the DB).
>
> On Fri, 21 Oct 2016 at 16:30 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> thanks Ben,
>
> > 1) At what stage did you have (or expect to have) 1000 rows (and have
> the mismatch between actual and expected) - at that end of operation (2) or
> after operation (3)?
>
> after operation 3), at operation 4) which reads all rows by cqlsh with
> CL.SERIAL
>
> > 2) What replication factor and replication strategy is used by the test
> keyspace? What consistency level is used by your operations?
>
> - create keyspace testkeyspace WITH REPLICATION =
> {'class':'SimpleStrategy','replication_factor':3};
> - consistency level is SERIAL
>
>
> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
>
> A couple of questions:
> 1) At what stage did you have (or expect to have) 1000 rows (and have the
> mismatch between actual and expected) - at that end of operation (2) or
> after operation (3)?
> 2) What replication factor and replication strategy is used by the test
> keyspace? What consistency level is used by your operations?
>
>
> Cheers
> Ben
>
> On Fri, 21 Oct 2016 at 13:57 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thanks Ben,
>
> I tried to run a rebuild and repair after the failure node rejoined the
> cluster as a "new" node with -Dcassandra.replace_address_first_boot.
> The failure node could rejoined and I could read all rows successfully.
> (Sometimes a repair failed because the node cannot access other node. If
> it failed, I retried a repair)
>
> But some rows were lost after my destructive test repeated (after about
> 5-6 hours).
> After the test inserted 1000 rows, there were only 953 rows at the end of
> the test.
>
> My destructive test:
> - each C* node is killed & restarted at the random interval (within about
> 5 min) throughout this test
> 1) truncate all tables
> 2) insert initial rows (check if all rows are inserted successfully)
> 3) request a lot of read/write to random rows for about 30min
> 4) check all rows
> If operation 1), 2) or 4) fail due to C* failure, the test retry the
> operation.
>
> Does anyone have the similar problem?
> What causes data lost?
> Does the test need any operation when C* node is restarted? (Currently, I
> just restarted C* process)
>
> Regards,
>
>
> On Tue, Oct 18, 2016 at 2:18 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> OK, that’s a bit more unexpected (to me at least) but I think the solution
> of running a rebuild or repair still applies.
>
> On Tue, 18 Oct 2016 at 15:45 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thanks Ben, Jeff
>
> Sorry that my explanation confused you.
>
> Only node1 is the seed node.
> Node2 whose C* data is deleted is NOT a seed.
>
> I restarted the failure node(node2) after restarting the seed node(node1).
> The restarting node2 succeeded without the exception.
> (I couldn't restart node2 before restarting node1 as expected.)
>
> Regards,
>
>
> On Tue, Oct 18, 2016 at 1:06 PM, Jeff Jirsa <je...@crowdstrike.com>
> wrote:
>
> The unstated "problem" here is that node1 is a seed, which implies
> auto_bootstrap=false (can't bootstrap a seed, so it was almost certainly
> setup to start without bootstrapping).
>
> That means once the data dir is wiped, it's going to start again without a
> bootstrap, and make a single node cluster or join an existing cluster if
> the seed list is valid
>
>
>
> --
> Jeff Jirsa
>
>
> On Oct 17, 2016, at 8:51 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> OK, sorry - I think understand what you are asking now.
>
> However, I’m still a little confused by your description. I think your
> scenario is:
> 1) Stop C* on all nodes in a cluster (Nodes A,B,C)
> 2) Delete all data from Node A
> 3) Restart Node A
> 4) Restart Node B,C
>
> Is this correct?
>
> If so, this isn’t a scenario I’ve tested/seen but I’m not surprised Node A
> starts succesfully as there are no running nodes to tell it via gossip that
> it shouldn’t start up without the “replaces” flag.
>
> I think that right way to recover in this scenario is to run a nodetool
> rebuild on Node A after the other two nodes are running. You could
> theoretically also run a repair (which would be good practice after a weird
> failure scenario like this) but rebuild will probably be quicker given you
> know all the data needs to be re-streamed.
>
> Cheers
> Ben
>
> On Tue, 18 Oct 2016 at 14:03 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thank you Ben, Yabin
>
> I understood the rejoin was illegal.
> I expected this rejoin would fail with the exception.
> But I could add the failure node to the cluster without the
> exception after 2) and 3).
> I want to know why the rejoin succeeds. Should the exception happen?
>
> Regards,
>
>
> On Tue, Oct 18, 2016 at 1:51 AM, Yabin Meng <ya...@gmail.com> wrote:
>
> The exception you run into is expected behavior. This is because as Ben
> pointed out, when you delete everything (including system schemas), C*
> cluster thinks you're bootstrapping a new node. However,  node2's IP is
> still in gossip and this is why you see the exception.
>
> I'm not clear the reasoning why you need to delete C* data directory. That
> is a dangerous action, especially considering that you delete system
> schemas. If in any case the failure node is gone for a while, what you need
> to do is to is remove the node first before doing "rejoin".
>
> Cheers,
>
> Yabin
>
> On Mon, Oct 17, 2016 at 1:48 AM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> To cassandra, the node where you deleted the files looks like a brand new
> machine. It doesn’t automatically rebuild machines to prevent accidental
> replacement. You need to tell it to build the “new” machines as a
> replacement for the “old” machine with that IP by setting -Dcassandra.replace_address_first_boot=<dead_node_ip>.
> See http://cassandra.apache.org/doc/latest/operating/topo_changes.html
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__cassandra.apache.org_doc_latest_operating_topo-5Fchanges.html&d=DQMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KGo0EnUT-Bop-0OnyQJRuFvNOf99S9tWEgziATmNfJ8&s=YazqmnV8TuuQXt9PDn0kFe6C08b7tQQXrqouXBCVVXE&e=>
> .
>
> Cheers
> Ben
>
> On Mon, 17 Oct 2016 at 16:41 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Hi all,
>
> A failure node can rejoin a cluster.
> On the node, all data in /var/lib/cassandra were deleted.
> Is it normal?
>
> I can reproduce it as below.
>
> cluster:
> - C* 2.2.7
> - a cluster has node1, 2, 3
> - node1 is a seed
> - replication_factor: 3
>
> how to:
> 1) stop C* process and delete all data in /var/lib/cassandra on node2
> ($sudo rm -rf /var/lib/cassandra/*)
> 2) stop C* process on node1 and node3
> 3) restart C* on node1
> 4) restart C* on node2
>
> nodetool status after 4):
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens       Owns (effective)  Host ID
>                         Rack
> DN  [node3 IP]  ?                 256          100.0%
>  325553c6-3e05-41f6-a1f7-47436743816f  rack1
> UN  [node2 IP]  7.76 MB      256          100.0%
>  05bdb1d4-c39b-48f1-8248-911d61935925  rack1
> UN  [node1 IP]  416.13 MB  256          100.0%
>  a8ec0a31-cb92-44b0-b156-5bcd4f6f2c7b  rack1
>
> If I restart C* on node 2 when C* on node1 and node3 are running (without
> 2), 3)), a runtime exception happens.
> RuntimeException: "A node with address [node2 IP] already exists,
> cancelling join..."
>
> I'm not sure this causes data lost. All data can be read properly just
> after this rejoin.
> But some rows are lost when I kill&restart C* for destructive tests after
> this rejoin.
>
> Thanks.
>
> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>
>
>
> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>
> ____________________________________________________________________
> CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and
> may be legally privileged. If you are not the intended recipient, do not
> disclose, copy, distribute, or use this email or any attachments. If you
> have received this in error please let the sender know and then delete the
> email and all attachments.
>
>
> --
> ————————
> Ben Slater
> Chief Product Officer
>
> --
————————
Ben Slater
Chief Product Officer
Instaclustr: Cassandra + Spark - Managed | Consulting | Support
+61 437 929 798

Re: failure node rejoin

Posted by Yuji Ito <yu...@imagine-orb.com>.

> Are you certain your tests don’t generate any overlapping inserts (by PK)?

Yes. The operation 2) also checks the number of rows just after all
insertions.


On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater <be...@instaclustr.com>
wrote:

> OK. Are you certain your tests don’t generate any overlapping inserts (by
> PK)? Cassandra basically treats any inserts with the same primary key as
> updates (so 1000 insert operations may not necessarily result in 1000 rows
> in the DB).
>
> On Fri, 21 Oct 2016 at 16:30 Yuji Ito <yu...@imagine-orb.com> wrote:
>
>> thanks Ben,
>>
>> > 1) At what stage did you have (or expect to have) 1000 rows (and have
>> the mismatch between actual and expected) - at that end of operation (2) or
>> after operation (3)?
>>
>> after operation 3), at operation 4) which reads all rows by cqlsh with
>> CL.SERIAL
>>
>> > 2) What replication factor and replication strategy is used by the test
>> keyspace? What consistency level is used by your operations?
>>
>> - create keyspace testkeyspace WITH REPLICATION =
>> {'class':'SimpleStrategy','replication_factor':3};
>> - consistency level is SERIAL
>>
>>
>> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>>
>> A couple of questions:
>> 1) At what stage did you have (or expect to have) 1000 rows (and have the
>> mismatch between actual and expected) - at that end of operation (2) or
>> after operation (3)?
>> 2) What replication factor and replication strategy is used by the test
>> keyspace? What consistency level is used by your operations?
>>
>>
>> Cheers
>> Ben
>>
>> On Fri, 21 Oct 2016 at 13:57 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Thanks Ben,
>>
>> I tried to run a rebuild and repair after the failure node rejoined the
>> cluster as a "new" node with -Dcassandra.replace_address_first_boot.
>> The failure node could rejoined and I could read all rows successfully.
>> (Sometimes a repair failed because the node cannot access other node. If
>> it failed, I retried a repair)
>>
>> But some rows were lost after my destructive test repeated (after about
>> 5-6 hours).
>> After the test inserted 1000 rows, there were only 953 rows at the end of
>> the test.
>>
>> My destructive test:
>> - each C* node is killed & restarted at the random interval (within about
>> 5 min) throughout this test
>> 1) truncate all tables
>> 2) insert initial rows (check if all rows are inserted successfully)
>> 3) request a lot of read/write to random rows for about 30min
>> 4) check all rows
>> If operation 1), 2) or 4) fail due to C* failure, the test retry the
>> operation.
>>
>> Does anyone have the similar problem?
>> What causes data lost?
>> Does the test need any operation when C* node is restarted? (Currently, I
>> just restarted C* process)
>>
>> Regards,
>>
>>
>> On Tue, Oct 18, 2016 at 2:18 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> OK, that’s a bit more unexpected (to me at least) but I think the
>> solution of running a rebuild or repair still applies.
>>
>> On Tue, 18 Oct 2016 at 15:45 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Thanks Ben, Jeff
>>
>> Sorry that my explanation confused you.
>>
>> Only node1 is the seed node.
>> Node2 whose C* data is deleted is NOT a seed.
>>
>> I restarted the failure node(node2) after restarting the seed node(node1).
>> The restarting node2 succeeded without the exception.
>> (I couldn't restart node2 before restarting node1 as expected.)
>>
>> Regards,
>>
>>
>> On Tue, Oct 18, 2016 at 1:06 PM, Jeff Jirsa <je...@crowdstrike.com>
>> wrote:
>>
>> The unstated "problem" here is that node1 is a seed, which implies
>> auto_bootstrap=false (can't bootstrap a seed, so it was almost certainly
>> setup to start without bootstrapping).
>>
>> That means once the data dir is wiped, it's going to start again without
>> a bootstrap, and make a single node cluster or join an existing cluster if
>> the seed list is valid
>>
>>
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Oct 17, 2016, at 8:51 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> OK, sorry - I think understand what you are asking now.
>>
>> However, I’m still a little confused by your description. I think your
>> scenario is:
>> 1) Stop C* on all nodes in a cluster (Nodes A,B,C)
>> 2) Delete all data from Node A
>> 3) Restart Node A
>> 4) Restart Node B,C
>>
>> Is this correct?
>>
>> If so, this isn’t a scenario I’ve tested/seen but I’m not surprised Node
>> A starts succesfully as there are no running nodes to tell it via gossip
>> that it shouldn’t start up without the “replaces” flag.
>>
>> I think that right way to recover in this scenario is to run a nodetool
>> rebuild on Node A after the other two nodes are running. You could
>> theoretically also run a repair (which would be good practice after a weird
>> failure scenario like this) but rebuild will probably be quicker given you
>> know all the data needs to be re-streamed.
>>
>> Cheers
>> Ben
>>
>> On Tue, 18 Oct 2016 at 14:03 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Thank you Ben, Yabin
>>
>> I understood the rejoin was illegal.
>> I expected this rejoin would fail with the exception.
>> But I could add the failure node to the cluster without the
>> exception after 2) and 3).
>> I want to know why the rejoin succeeds. Should the exception happen?
>>
>> Regards,
>>
>>
>> On Tue, Oct 18, 2016 at 1:51 AM, Yabin Meng <ya...@gmail.com> wrote:
>>
>> The exception you run into is expected behavior. This is because as Ben
>> pointed out, when you delete everything (including system schemas), C*
>> cluster thinks you're bootstrapping a new node. However,  node2's IP is
>> still in gossip and this is why you see the exception.
>>
>> I'm not clear the reasoning why you need to delete C* data directory.
>> That is a dangerous action, especially considering that you delete system
>> schemas. If in any case the failure node is gone for a while, what you need
>> to do is to is remove the node first before doing "rejoin".
>>
>> Cheers,
>>
>> Yabin
>>
>> On Mon, Oct 17, 2016 at 1:48 AM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> To cassandra, the node where you deleted the files looks like a brand new
>> machine. It doesn’t automatically rebuild machines to prevent accidental
>> replacement. You need to tell it to build the “new” machines as a
>> replacement for the “old” machine with that IP by setting
>> -Dcassandra.replace_address_first_boot=<dead_node_ip>. See
>> http://cassandra.apache.org/doc/latest/operating/topo_changes.html
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__cassandra.apache.org_doc_latest_operating_topo-5Fchanges.html&d=DQMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KGo0EnUT-Bop-0OnyQJRuFvNOf99S9tWEgziATmNfJ8&s=YazqmnV8TuuQXt9PDn0kFe6C08b7tQQXrqouXBCVVXE&e=>
>> .
>>
>> Cheers
>> Ben
>>
>> On Mon, 17 Oct 2016 at 16:41 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Hi all,
>>
>> A failure node can rejoin a cluster.
>> On the node, all data in /var/lib/cassandra were deleted.
>> Is it normal?
>>
>> I can reproduce it as below.
>>
>> cluster:
>> - C* 2.2.7
>> - a cluster has node1, 2, 3
>> - node1 is a seed
>> - replication_factor: 3
>>
>> how to:
>> 1) stop C* process and delete all data in /var/lib/cassandra on node2
>> ($sudo rm -rf /var/lib/cassandra/*)
>> 2) stop C* process on node1 and node3
>> 3) restart C* on node1
>> 4) restart C* on node2
>>
>> nodetool status after 4):
>> Datacenter: datacenter1
>> =======================
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address        Load       Tokens       Owns (effective)  Host ID
>>                           Rack
>> DN  [node3 IP]  ?                 256          100.0%
>>  325553c6-3e05-41f6-a1f7-47436743816f  rack1
>> UN  [node2 IP]  7.76 MB      256          100.0%
>>  05bdb1d4-c39b-48f1-8248-911d61935925  rack1
>> UN  [node1 IP]  416.13 MB  256          100.0%
>>  a8ec0a31-cb92-44b0-b156-5bcd4f6f2c7b  rack1
>>
>> If I restart C* on node 2 when C* on node1 and node3 are running (without
>> 2), 3)), a runtime exception happens.
>> RuntimeException: "A node with address [node2 IP] already exists,
>> cancelling join..."
>>
>> I'm not sure this causes data lost. All data can be read properly just
>> after this rejoin.
>> But some rows are lost when I kill&restart C* for destructive tests after
>> this rejoin.
>>
>> Thanks.
>>
>> --
>> ————————
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
>>
>>
>>
>> --
>> ————————
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
>>
>> ____________________________________________________________________
>> CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential
>> and may be legally privileged. If you are not the intended recipient, do
>> not disclose, copy, distribute, or use this email or any attachments. If
>> you have received this in error please let the sender know and then delete
>> the email and all attachments.
>>
>>
>> --
>> ————————
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
>>
>>
>> --
>> ————————
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
>>
>>
>> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>

Re: failure node rejoin

Posted by Ben Slater <be...@instaclustr.com>.

OK. Are you certain your tests don’t generate any overlapping inserts (by
PK)? Cassandra basically treats any inserts with the same primary key as
updates (so 1000 insert operations may not necessarily result in 1000 rows
in the DB).

On Fri, 21 Oct 2016 at 16:30 Yuji Ito <yu...@imagine-orb.com> wrote:

> thanks Ben,
>
> > 1) At what stage did you have (or expect to have) 1000 rows (and have
> the mismatch between actual and expected) - at that end of operation (2) or
> after operation (3)?
>
> after operation 3), at operation 4) which reads all rows by cqlsh with
> CL.SERIAL
>
> > 2) What replication factor and replication strategy is used by the test
> keyspace? What consistency level is used by your operations?
>
> - create keyspace testkeyspace WITH REPLICATION =
> {'class':'SimpleStrategy','replication_factor':3};
> - consistency level is SERIAL
>
>
> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
>
> A couple of questions:
> 1) At what stage did you have (or expect to have) 1000 rows (and have the
> mismatch between actual and expected) - at that end of operation (2) or
> after operation (3)?
> 2) What replication factor and replication strategy is used by the test
> keyspace? What consistency level is used by your operations?
>
>
> Cheers
> Ben
>
> On Fri, 21 Oct 2016 at 13:57 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thanks Ben,
>
> I tried to run a rebuild and repair after the failure node rejoined the
> cluster as a "new" node with -Dcassandra.replace_address_first_boot.
> The failure node could rejoined and I could read all rows successfully.
> (Sometimes a repair failed because the node cannot access other node. If
> it failed, I retried a repair)
>
> But some rows were lost after my destructive test repeated (after about
> 5-6 hours).
> After the test inserted 1000 rows, there were only 953 rows at the end of
> the test.
>
> My destructive test:
> - each C* node is killed & restarted at the random interval (within about
> 5 min) throughout this test
> 1) truncate all tables
> 2) insert initial rows (check if all rows are inserted successfully)
> 3) request a lot of read/write to random rows for about 30min
> 4) check all rows
> If operation 1), 2) or 4) fail due to C* failure, the test retry the
> operation.
>
> Does anyone have the similar problem?
> What causes data lost?
> Does the test need any operation when C* node is restarted? (Currently, I
> just restarted C* process)
>
> Regards,
>
>
> On Tue, Oct 18, 2016 at 2:18 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> OK, that’s a bit more unexpected (to me at least) but I think the solution
> of running a rebuild or repair still applies.
>
> On Tue, 18 Oct 2016 at 15:45 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thanks Ben, Jeff
>
> Sorry that my explanation confused you.
>
> Only node1 is the seed node.
> Node2 whose C* data is deleted is NOT a seed.
>
> I restarted the failure node(node2) after restarting the seed node(node1).
> The restarting node2 succeeded without the exception.
> (I couldn't restart node2 before restarting node1 as expected.)
>
> Regards,
>
>
> On Tue, Oct 18, 2016 at 1:06 PM, Jeff Jirsa <je...@crowdstrike.com>
> wrote:
>
> The unstated "problem" here is that node1 is a seed, which implies
> auto_bootstrap=false (can't bootstrap a seed, so it was almost certainly
> setup to start without bootstrapping).
>
> That means once the data dir is wiped, it's going to start again without a
> bootstrap, and make a single node cluster or join an existing cluster if
> the seed list is valid
>
>
>
> --
> Jeff Jirsa
>
>
> On Oct 17, 2016, at 8:51 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> OK, sorry - I think understand what you are asking now.
>
> However, I’m still a little confused by your description. I think your
> scenario is:
> 1) Stop C* on all nodes in a cluster (Nodes A,B,C)
> 2) Delete all data from Node A
> 3) Restart Node A
> 4) Restart Node B,C
>
> Is this correct?
>
> If so, this isn’t a scenario I’ve tested/seen but I’m not surprised Node A
> starts succesfully as there are no running nodes to tell it via gossip that
> it shouldn’t start up without the “replaces” flag.
>
> I think that right way to recover in this scenario is to run a nodetool
> rebuild on Node A after the other two nodes are running. You could
> theoretically also run a repair (which would be good practice after a weird
> failure scenario like this) but rebuild will probably be quicker given you
> know all the data needs to be re-streamed.
>
> Cheers
> Ben
>
> On Tue, 18 Oct 2016 at 14:03 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thank you Ben, Yabin
>
> I understood the rejoin was illegal.
> I expected this rejoin would fail with the exception.
> But I could add the failure node to the cluster without the
> exception after 2) and 3).
> I want to know why the rejoin succeeds. Should the exception happen?
>
> Regards,
>
>
> On Tue, Oct 18, 2016 at 1:51 AM, Yabin Meng <ya...@gmail.com> wrote:
>
> The exception you run into is expected behavior. This is because as Ben
> pointed out, when you delete everything (including system schemas), C*
> cluster thinks you're bootstrapping a new node. However,  node2's IP is
> still in gossip and this is why you see the exception.
>
> I'm not clear the reasoning why you need to delete C* data directory. That
> is a dangerous action, especially considering that you delete system
> schemas. If in any case the failure node is gone for a while, what you need
> to do is to is remove the node first before doing "rejoin".
>
> Cheers,
>
> Yabin
>
> On Mon, Oct 17, 2016 at 1:48 AM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> To cassandra, the node where you deleted the files looks like a brand new
> machine. It doesn’t automatically rebuild machines to prevent accidental
> replacement. You need to tell it to build the “new” machines as a
> replacement for the “old” machine with that IP by setting -Dcassandra.replace_address_first_boot=<dead_node_ip>.
> See http://cassandra.apache.org/doc/latest/operating/topo_changes.html
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__cassandra.apache.org_doc_latest_operating_topo-5Fchanges.html&d=DQMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KGo0EnUT-Bop-0OnyQJRuFvNOf99S9tWEgziATmNfJ8&s=YazqmnV8TuuQXt9PDn0kFe6C08b7tQQXrqouXBCVVXE&e=>
> .
>
> Cheers
> Ben
>
> On Mon, 17 Oct 2016 at 16:41 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Hi all,
>
> A failure node can rejoin a cluster.
> On the node, all data in /var/lib/cassandra were deleted.
> Is it normal?
>
> I can reproduce it as below.
>
> cluster:
> - C* 2.2.7
> - a cluster has node1, 2, 3
> - node1 is a seed
> - replication_factor: 3
>
> how to:
> 1) stop C* process and delete all data in /var/lib/cassandra on node2
> ($sudo rm -rf /var/lib/cassandra/*)
> 2) stop C* process on node1 and node3
> 3) restart C* on node1
> 4) restart C* on node2
>
> nodetool status after 4):
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens       Owns (effective)  Host ID
>                         Rack
> DN  [node3 IP]  ?                 256          100.0%
>  325553c6-3e05-41f6-a1f7-47436743816f  rack1
> UN  [node2 IP]  7.76 MB      256          100.0%
>  05bdb1d4-c39b-48f1-8248-911d61935925  rack1
> UN  [node1 IP]  416.13 MB  256          100.0%
>  a8ec0a31-cb92-44b0-b156-5bcd4f6f2c7b  rack1
>
> If I restart C* on node 2 when C* on node1 and node3 are running (without
> 2), 3)), a runtime exception happens.
> RuntimeException: "A node with address [node2 IP] already exists,
> cancelling join..."
>
> I'm not sure this causes data lost. All data can be read properly just
> after this rejoin.
> But some rows are lost when I kill&restart C* for destructive tests after
> this rejoin.
>
> Thanks.
>
> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>
>
>
> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>
> ____________________________________________________________________
> CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and
> may be legally privileged. If you are not the intended recipient, do not
> disclose, copy, distribute, or use this email or any attachments. If you
> have received this in error please let the sender know and then delete the
> email and all attachments.
>
>
> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>
>
> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>
>
> --
————————
Ben Slater
Chief Product Officer
Instaclustr: Cassandra + Spark - Managed | Consulting | Support
+61 437 929 798

Re: failure node rejoin

Posted by Yuji Ito <yu...@imagine-orb.com>.

thanks Ben,

> 1) At what stage did you have (or expect to have) 1000 rows (and have the
mismatch between actual and expected) - at that end of operation (2) or
after operation (3)?

after operation 3), at operation 4) which reads all rows by cqlsh with
CL.SERIAL

> 2) What replication factor and replication strategy is used by the test
keyspace? What consistency level is used by your operations?

- create keyspace testkeyspace WITH REPLICATION =
{'class':'SimpleStrategy','replication_factor':3};
- consistency level is SERIAL


On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater <be...@instaclustr.com>
wrote:

>
> A couple of questions:
> 1) At what stage did you have (or expect to have) 1000 rows (and have the
> mismatch between actual and expected) - at that end of operation (2) or
> after operation (3)?
> 2) What replication factor and replication strategy is used by the test
> keyspace? What consistency level is used by your operations?
>
>
> Cheers
> Ben
>
> On Fri, 21 Oct 2016 at 13:57 Yuji Ito <yu...@imagine-orb.com> wrote:
>
>> Thanks Ben,
>>
>> I tried to run a rebuild and repair after the failure node rejoined the
>> cluster as a "new" node with -Dcassandra.replace_address_first_boot.
>> The failure node could rejoined and I could read all rows successfully.
>> (Sometimes a repair failed because the node cannot access other node. If
>> it failed, I retried a repair)
>>
>> But some rows were lost after my destructive test repeated (after about
>> 5-6 hours).
>> After the test inserted 1000 rows, there were only 953 rows at the end of
>> the test.
>>
>> My destructive test:
>> - each C* node is killed & restarted at the random interval (within about
>> 5 min) throughout this test
>> 1) truncate all tables
>> 2) insert initial rows (check if all rows are inserted successfully)
>> 3) request a lot of read/write to random rows for about 30min
>> 4) check all rows
>> If operation 1), 2) or 4) fail due to C* failure, the test retry the
>> operation.
>>
>> Does anyone have the similar problem?
>> What causes data lost?
>> Does the test need any operation when C* node is restarted? (Currently, I
>> just restarted C* process)
>>
>> Regards,
>>
>>
>> On Tue, Oct 18, 2016 at 2:18 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> OK, that’s a bit more unexpected (to me at least) but I think the
>> solution of running a rebuild or repair still applies.
>>
>> On Tue, 18 Oct 2016 at 15:45 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Thanks Ben, Jeff
>>
>> Sorry that my explanation confused you.
>>
>> Only node1 is the seed node.
>> Node2 whose C* data is deleted is NOT a seed.
>>
>> I restarted the failure node(node2) after restarting the seed node(node1).
>> The restarting node2 succeeded without the exception.
>> (I couldn't restart node2 before restarting node1 as expected.)
>>
>> Regards,
>>
>>
>> On Tue, Oct 18, 2016 at 1:06 PM, Jeff Jirsa <je...@crowdstrike.com>
>> wrote:
>>
>> The unstated "problem" here is that node1 is a seed, which implies
>> auto_bootstrap=false (can't bootstrap a seed, so it was almost certainly
>> setup to start without bootstrapping).
>>
>> That means once the data dir is wiped, it's going to start again without
>> a bootstrap, and make a single node cluster or join an existing cluster if
>> the seed list is valid
>>
>>
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Oct 17, 2016, at 8:51 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> OK, sorry - I think understand what you are asking now.
>>
>> However, I’m still a little confused by your description. I think your
>> scenario is:
>> 1) Stop C* on all nodes in a cluster (Nodes A,B,C)
>> 2) Delete all data from Node A
>> 3) Restart Node A
>> 4) Restart Node B,C
>>
>> Is this correct?
>>
>> If so, this isn’t a scenario I’ve tested/seen but I’m not surprised Node
>> A starts succesfully as there are no running nodes to tell it via gossip
>> that it shouldn’t start up without the “replaces” flag.
>>
>> I think that right way to recover in this scenario is to run a nodetool
>> rebuild on Node A after the other two nodes are running. You could
>> theoretically also run a repair (which would be good practice after a weird
>> failure scenario like this) but rebuild will probably be quicker given you
>> know all the data needs to be re-streamed.
>>
>> Cheers
>> Ben
>>
>> On Tue, 18 Oct 2016 at 14:03 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Thank you Ben, Yabin
>>
>> I understood the rejoin was illegal.
>> I expected this rejoin would fail with the exception.
>> But I could add the failure node to the cluster without the
>> exception after 2) and 3).
>> I want to know why the rejoin succeeds. Should the exception happen?
>>
>> Regards,
>>
>>
>> On Tue, Oct 18, 2016 at 1:51 AM, Yabin Meng <ya...@gmail.com> wrote:
>>
>> The exception you run into is expected behavior. This is because as Ben
>> pointed out, when you delete everything (including system schemas), C*
>> cluster thinks you're bootstrapping a new node. However,  node2's IP is
>> still in gossip and this is why you see the exception.
>>
>> I'm not clear the reasoning why you need to delete C* data directory.
>> That is a dangerous action, especially considering that you delete system
>> schemas. If in any case the failure node is gone for a while, what you need
>> to do is to is remove the node first before doing "rejoin".
>>
>> Cheers,
>>
>> Yabin
>>
>> On Mon, Oct 17, 2016 at 1:48 AM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> To cassandra, the node where you deleted the files looks like a brand new
>> machine. It doesn’t automatically rebuild machines to prevent accidental
>> replacement. You need to tell it to build the “new” machines as a
>> replacement for the “old” machine with that IP by setting
>> -Dcassandra.replace_address_first_boot=<dead_node_ip>. See
>> http://cassandra.apache.org/doc/latest/operating/topo_changes.html
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__cassandra.apache.org_doc_latest_operating_topo-5Fchanges.html&d=DQMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KGo0EnUT-Bop-0OnyQJRuFvNOf99S9tWEgziATmNfJ8&s=YazqmnV8TuuQXt9PDn0kFe6C08b7tQQXrqouXBCVVXE&e=>
>> .
>>
>> Cheers
>> Ben
>>
>> On Mon, 17 Oct 2016 at 16:41 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Hi all,
>>
>> A failure node can rejoin a cluster.
>> On the node, all data in /var/lib/cassandra were deleted.
>> Is it normal?
>>
>> I can reproduce it as below.
>>
>> cluster:
>> - C* 2.2.7
>> - a cluster has node1, 2, 3
>> - node1 is a seed
>> - replication_factor: 3
>>
>> how to:
>> 1) stop C* process and delete all data in /var/lib/cassandra on node2
>> ($sudo rm -rf /var/lib/cassandra/*)
>> 2) stop C* process on node1 and node3
>> 3) restart C* on node1
>> 4) restart C* on node2
>>
>> nodetool status after 4):
>> Datacenter: datacenter1
>> =======================
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address        Load       Tokens       Owns (effective)  Host ID
>>                           Rack
>> DN  [node3 IP]  ?                 256          100.0%
>>  325553c6-3e05-41f6-a1f7-47436743816f  rack1
>> UN  [node2 IP]  7.76 MB      256          100.0%
>>  05bdb1d4-c39b-48f1-8248-911d61935925  rack1
>> UN  [node1 IP]  416.13 MB  256          100.0%
>>  a8ec0a31-cb92-44b0-b156-5bcd4f6f2c7b  rack1
>>
>> If I restart C* on node 2 when C* on node1 and node3 are running (without
>> 2), 3)), a runtime exception happens.
>> RuntimeException: "A node with address [node2 IP] already exists,
>> cancelling join..."
>>
>> I'm not sure this causes data lost. All data can be read properly just
>> after this rejoin.
>> But some rows are lost when I kill&restart C* for destructive tests after
>> this rejoin.
>>
>> Thanks.
>>
>> --
>> ————————
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
>>
>>
>>
>> --
>> ————————
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
>>
>> ____________________________________________________________________
>> CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential
>> and may be legally privileged. If you are not the intended recipient, do
>> not disclose, copy, distribute, or use this email or any attachments. If
>> you have received this in error please let the sender know and then delete
>> the email and all attachments.
>>
>>
>> --
>> ————————
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
>>
>>
>> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>

Re: failure node rejoin

Posted by Ben Slater <be...@instaclustr.com>.

A couple of questions:
1) At what stage did you have (or expect to have) 1000 rows (and have the
mismatch between actual and expected) - at that end of operation (2) or
after operation (3)?
2) What replication factor and replication strategy is used by the test
keyspace? What consistency level is used by your operations?


Cheers
Ben

On Fri, 21 Oct 2016 at 13:57 Yuji Ito <yu...@imagine-orb.com> wrote:

> Thanks Ben,
>
> I tried to run a rebuild and repair after the failure node rejoined the
> cluster as a "new" node with -Dcassandra.replace_address_first_boot.
> The failure node could rejoined and I could read all rows successfully.
> (Sometimes a repair failed because the node cannot access other node. If
> it failed, I retried a repair)
>
> But some rows were lost after my destructive test repeated (after about
> 5-6 hours).
> After the test inserted 1000 rows, there were only 953 rows at the end of
> the test.
>
> My destructive test:
> - each C* node is killed & restarted at the random interval (within about
> 5 min) throughout this test
> 1) truncate all tables
> 2) insert initial rows (check if all rows are inserted successfully)
> 3) request a lot of read/write to random rows for about 30min
> 4) check all rows
> If operation 1), 2) or 4) fail due to C* failure, the test retry the
> operation.
>
> Does anyone have the similar problem?
> What causes data lost?
> Does the test need any operation when C* node is restarted? (Currently, I
> just restarted C* process)
>
> Regards,
>
>
> On Tue, Oct 18, 2016 at 2:18 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> OK, that’s a bit more unexpected (to me at least) but I think the solution
> of running a rebuild or repair still applies.
>
> On Tue, 18 Oct 2016 at 15:45 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thanks Ben, Jeff
>
> Sorry that my explanation confused you.
>
> Only node1 is the seed node.
> Node2 whose C* data is deleted is NOT a seed.
>
> I restarted the failure node(node2) after restarting the seed node(node1).
> The restarting node2 succeeded without the exception.
> (I couldn't restart node2 before restarting node1 as expected.)
>
> Regards,
>
>
> On Tue, Oct 18, 2016 at 1:06 PM, Jeff Jirsa <je...@crowdstrike.com>
> wrote:
>
> The unstated "problem" here is that node1 is a seed, which implies
> auto_bootstrap=false (can't bootstrap a seed, so it was almost certainly
> setup to start without bootstrapping).
>
> That means once the data dir is wiped, it's going to start again without a
> bootstrap, and make a single node cluster or join an existing cluster if
> the seed list is valid
>
>
>
> --
> Jeff Jirsa
>
>
> On Oct 17, 2016, at 8:51 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> OK, sorry - I think understand what you are asking now.
>
> However, I’m still a little confused by your description. I think your
> scenario is:
> 1) Stop C* on all nodes in a cluster (Nodes A,B,C)
> 2) Delete all data from Node A
> 3) Restart Node A
> 4) Restart Node B,C
>
> Is this correct?
>
> If so, this isn’t a scenario I’ve tested/seen but I’m not surprised Node A
> starts succesfully as there are no running nodes to tell it via gossip that
> it shouldn’t start up without the “replaces” flag.
>
> I think that right way to recover in this scenario is to run a nodetool
> rebuild on Node A after the other two nodes are running. You could
> theoretically also run a repair (which would be good practice after a weird
> failure scenario like this) but rebuild will probably be quicker given you
> know all the data needs to be re-streamed.
>
> Cheers
> Ben
>
> On Tue, 18 Oct 2016 at 14:03 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thank you Ben, Yabin
>
> I understood the rejoin was illegal.
> I expected this rejoin would fail with the exception.
> But I could add the failure node to the cluster without the
> exception after 2) and 3).
> I want to know why the rejoin succeeds. Should the exception happen?
>
> Regards,
>
>
> On Tue, Oct 18, 2016 at 1:51 AM, Yabin Meng <ya...@gmail.com> wrote:
>
> The exception you run into is expected behavior. This is because as Ben
> pointed out, when you delete everything (including system schemas), C*
> cluster thinks you're bootstrapping a new node. However,  node2's IP is
> still in gossip and this is why you see the exception.
>
> I'm not clear the reasoning why you need to delete C* data directory. That
> is a dangerous action, especially considering that you delete system
> schemas. If in any case the failure node is gone for a while, what you need
> to do is to is remove the node first before doing "rejoin".
>
> Cheers,
>
> Yabin
>
> On Mon, Oct 17, 2016 at 1:48 AM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> To cassandra, the node where you deleted the files looks like a brand new
> machine. It doesn’t automatically rebuild machines to prevent accidental
> replacement. You need to tell it to build the “new” machines as a
> replacement for the “old” machine with that IP by setting -Dcassandra.replace_address_first_boot=<dead_node_ip>.
> See http://cassandra.apache.org/doc/latest/operating/topo_changes.html
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__cassandra.apache.org_doc_latest_operating_topo-5Fchanges.html&d=DQMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KGo0EnUT-Bop-0OnyQJRuFvNOf99S9tWEgziATmNfJ8&s=YazqmnV8TuuQXt9PDn0kFe6C08b7tQQXrqouXBCVVXE&e=>
> .
>
> Cheers
> Ben
>
> On Mon, 17 Oct 2016 at 16:41 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Hi all,
>
> A failure node can rejoin a cluster.
> On the node, all data in /var/lib/cassandra were deleted.
> Is it normal?
>
> I can reproduce it as below.
>
> cluster:
> - C* 2.2.7
> - a cluster has node1, 2, 3
> - node1 is a seed
> - replication_factor: 3
>
> how to:
> 1) stop C* process and delete all data in /var/lib/cassandra on node2
> ($sudo rm -rf /var/lib/cassandra/*)
> 2) stop C* process on node1 and node3
> 3) restart C* on node1
> 4) restart C* on node2
>
> nodetool status after 4):
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens       Owns (effective)  Host ID
>                         Rack
> DN  [node3 IP]  ?                 256          100.0%
>  325553c6-3e05-41f6-a1f7-47436743816f  rack1
> UN  [node2 IP]  7.76 MB      256          100.0%
>  05bdb1d4-c39b-48f1-8248-911d61935925  rack1
> UN  [node1 IP]  416.13 MB  256          100.0%
>  a8ec0a31-cb92-44b0-b156-5bcd4f6f2c7b  rack1
>
> If I restart C* on node 2 when C* on node1 and node3 are running (without
> 2), 3)), a runtime exception happens.
> RuntimeException: "A node with address [node2 IP] already exists,
> cancelling join..."
>
> I'm not sure this causes data lost. All data can be read properly just
> after this rejoin.
> But some rows are lost when I kill&restart C* for destructive tests after
> this rejoin.
>
> Thanks.
>
> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>
>
>
> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>
> ____________________________________________________________________
> CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and
> may be legally privileged. If you are not the intended recipient, do not
> disclose, copy, distribute, or use this email or any attachments. If you
> have received this in error please let the sender know and then delete the
> email and all attachments.
>
>
> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>
>
> --
————————
Ben Slater
Chief Product Officer
Instaclustr: Cassandra + Spark - Managed | Consulting | Support
+61 437 929 798

Re: failure node rejoin

Posted by Yuji Ito <yu...@imagine-orb.com>.

Thanks Ben,

I tried to run a rebuild and repair after the failure node rejoined the
cluster as a "new" node with -Dcassandra.replace_address_first_boot.
The failure node could rejoined and I could read all rows successfully.
(Sometimes a repair failed because the node cannot access other node. If it
failed, I retried a repair)

But some rows were lost after my destructive test repeated (after about 5-6
hours).
After the test inserted 1000 rows, there were only 953 rows at the end of
the test.

My destructive test:
- each C* node is killed & restarted at the random interval (within about 5
min) throughout this test
1) truncate all tables
2) insert initial rows (check if all rows are inserted successfully)
3) request a lot of read/write to random rows for about 30min
4) check all rows
If operation 1), 2) or 4) fail due to C* failure, the test retry the
operation.

Does anyone have the similar problem?
What causes data lost?
Does the test need any operation when C* node is restarted? (Currently, I
just restarted C* process)

Regards,


On Tue, Oct 18, 2016 at 2:18 PM, Ben Slater <be...@instaclustr.com>
wrote:

> OK, that’s a bit more unexpected (to me at least) but I think the solution
> of running a rebuild or repair still applies.
>
> On Tue, 18 Oct 2016 at 15:45 Yuji Ito <yu...@imagine-orb.com> wrote:
>
>> Thanks Ben, Jeff
>>
>> Sorry that my explanation confused you.
>>
>> Only node1 is the seed node.
>> Node2 whose C* data is deleted is NOT a seed.
>>
>> I restarted the failure node(node2) after restarting the seed node(node1).
>> The restarting node2 succeeded without the exception.
>> (I couldn't restart node2 before restarting node1 as expected.)
>>
>> Regards,
>>
>>
>> On Tue, Oct 18, 2016 at 1:06 PM, Jeff Jirsa <je...@crowdstrike.com>
>> wrote:
>>
>> The unstated "problem" here is that node1 is a seed, which implies
>> auto_bootstrap=false (can't bootstrap a seed, so it was almost certainly
>> setup to start without bootstrapping).
>>
>> That means once the data dir is wiped, it's going to start again without
>> a bootstrap, and make a single node cluster or join an existing cluster if
>> the seed list is valid
>>
>>
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Oct 17, 2016, at 8:51 PM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> OK, sorry - I think understand what you are asking now.
>>
>> However, I’m still a little confused by your description. I think your
>> scenario is:
>> 1) Stop C* on all nodes in a cluster (Nodes A,B,C)
>> 2) Delete all data from Node A
>> 3) Restart Node A
>> 4) Restart Node B,C
>>
>> Is this correct?
>>
>> If so, this isn’t a scenario I’ve tested/seen but I’m not surprised Node
>> A starts succesfully as there are no running nodes to tell it via gossip
>> that it shouldn’t start up without the “replaces” flag.
>>
>> I think that right way to recover in this scenario is to run a nodetool
>> rebuild on Node A after the other two nodes are running. You could
>> theoretically also run a repair (which would be good practice after a weird
>> failure scenario like this) but rebuild will probably be quicker given you
>> know all the data needs to be re-streamed.
>>
>> Cheers
>> Ben
>>
>> On Tue, 18 Oct 2016 at 14:03 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Thank you Ben, Yabin
>>
>> I understood the rejoin was illegal.
>> I expected this rejoin would fail with the exception.
>> But I could add the failure node to the cluster without the
>> exception after 2) and 3).
>> I want to know why the rejoin succeeds. Should the exception happen?
>>
>> Regards,
>>
>>
>> On Tue, Oct 18, 2016 at 1:51 AM, Yabin Meng <ya...@gmail.com> wrote:
>>
>> The exception you run into is expected behavior. This is because as Ben
>> pointed out, when you delete everything (including system schemas), C*
>> cluster thinks you're bootstrapping a new node. However,  node2's IP is
>> still in gossip and this is why you see the exception.
>>
>> I'm not clear the reasoning why you need to delete C* data directory.
>> That is a dangerous action, especially considering that you delete system
>> schemas. If in any case the failure node is gone for a while, what you need
>> to do is to is remove the node first before doing "rejoin".
>>
>> Cheers,
>>
>> Yabin
>>
>> On Mon, Oct 17, 2016 at 1:48 AM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> To cassandra, the node where you deleted the files looks like a brand new
>> machine. It doesn’t automatically rebuild machines to prevent accidental
>> replacement. You need to tell it to build the “new” machines as a
>> replacement for the “old” machine with that IP by setting
>> -Dcassandra.replace_address_first_boot=<dead_node_ip>. See
>> http://cassandra.apache.org/doc/latest/operating/topo_changes.html
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__cassandra.apache.org_doc_latest_operating_topo-5Fchanges.html&d=DQMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KGo0EnUT-Bop-0OnyQJRuFvNOf99S9tWEgziATmNfJ8&s=YazqmnV8TuuQXt9PDn0kFe6C08b7tQQXrqouXBCVVXE&e=>
>> .
>>
>> Cheers
>> Ben
>>
>> On Mon, 17 Oct 2016 at 16:41 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Hi all,
>>
>> A failure node can rejoin a cluster.
>> On the node, all data in /var/lib/cassandra were deleted.
>> Is it normal?
>>
>> I can reproduce it as below.
>>
>> cluster:
>> - C* 2.2.7
>> - a cluster has node1, 2, 3
>> - node1 is a seed
>> - replication_factor: 3
>>
>> how to:
>> 1) stop C* process and delete all data in /var/lib/cassandra on node2
>> ($sudo rm -rf /var/lib/cassandra/*)
>> 2) stop C* process on node1 and node3
>> 3) restart C* on node1
>> 4) restart C* on node2
>>
>> nodetool status after 4):
>> Datacenter: datacenter1
>> =======================
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address        Load       Tokens       Owns (effective)  Host ID
>>                           Rack
>> DN  [node3 IP]  ?                 256          100.0%
>>  325553c6-3e05-41f6-a1f7-47436743816f  rack1
>> UN  [node2 IP]  7.76 MB      256          100.0%
>>  05bdb1d4-c39b-48f1-8248-911d61935925  rack1
>> UN  [node1 IP]  416.13 MB  256          100.0%
>>  a8ec0a31-cb92-44b0-b156-5bcd4f6f2c7b  rack1
>>
>> If I restart C* on node 2 when C* on node1 and node3 are running (without
>> 2), 3)), a runtime exception happens.
>> RuntimeException: "A node with address [node2 IP] already exists,
>> cancelling join..."
>>
>> I'm not sure this causes data lost. All data can be read properly just
>> after this rejoin.
>> But some rows are lost when I kill&restart C* for destructive tests after
>> this rejoin.
>>
>> Thanks.
>>
>> --
>> ————————
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
>>
>>
>>
>> --
>> ————————
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
>>
>> ____________________________________________________________________
>> CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential
>> and may be legally privileged. If you are not the intended recipient, do
>> not disclose, copy, distribute, or use this email or any attachments. If
>> you have received this in error please let the sender know and then delete
>> the email and all attachments.
>>
>>
>> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>

Re: failure node rejoin

Posted by Ben Slater <be...@instaclustr.com>.

OK, that’s a bit more unexpected (to me at least) but I think the solution
of running a rebuild or repair still applies.

On Tue, 18 Oct 2016 at 15:45 Yuji Ito <yu...@imagine-orb.com> wrote:

> Thanks Ben, Jeff
>
> Sorry that my explanation confused you.
>
> Only node1 is the seed node.
> Node2 whose C* data is deleted is NOT a seed.
>
> I restarted the failure node(node2) after restarting the seed node(node1).
> The restarting node2 succeeded without the exception.
> (I couldn't restart node2 before restarting node1 as expected.)
>
> Regards,
>
>
> On Tue, Oct 18, 2016 at 1:06 PM, Jeff Jirsa <je...@crowdstrike.com>
> wrote:
>
> The unstated "problem" here is that node1 is a seed, which implies
> auto_bootstrap=false (can't bootstrap a seed, so it was almost certainly
> setup to start without bootstrapping).
>
> That means once the data dir is wiped, it's going to start again without a
> bootstrap, and make a single node cluster or join an existing cluster if
> the seed list is valid
>
>
>
> --
> Jeff Jirsa
>
>
> On Oct 17, 2016, at 8:51 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> OK, sorry - I think understand what you are asking now.
>
> However, I’m still a little confused by your description. I think your
> scenario is:
> 1) Stop C* on all nodes in a cluster (Nodes A,B,C)
> 2) Delete all data from Node A
> 3) Restart Node A
> 4) Restart Node B,C
>
> Is this correct?
>
> If so, this isn’t a scenario I’ve tested/seen but I’m not surprised Node A
> starts succesfully as there are no running nodes to tell it via gossip that
> it shouldn’t start up without the “replaces” flag.
>
> I think that right way to recover in this scenario is to run a nodetool
> rebuild on Node A after the other two nodes are running. You could
> theoretically also run a repair (which would be good practice after a weird
> failure scenario like this) but rebuild will probably be quicker given you
> know all the data needs to be re-streamed.
>
> Cheers
> Ben
>
> On Tue, 18 Oct 2016 at 14:03 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Thank you Ben, Yabin
>
> I understood the rejoin was illegal.
> I expected this rejoin would fail with the exception.
> But I could add the failure node to the cluster without the
> exception after 2) and 3).
> I want to know why the rejoin succeeds. Should the exception happen?
>
> Regards,
>
>
> On Tue, Oct 18, 2016 at 1:51 AM, Yabin Meng <ya...@gmail.com> wrote:
>
> The exception you run into is expected behavior. This is because as Ben
> pointed out, when you delete everything (including system schemas), C*
> cluster thinks you're bootstrapping a new node. However,  node2's IP is
> still in gossip and this is why you see the exception.
>
> I'm not clear the reasoning why you need to delete C* data directory. That
> is a dangerous action, especially considering that you delete system
> schemas. If in any case the failure node is gone for a while, what you need
> to do is to is remove the node first before doing "rejoin".
>
> Cheers,
>
> Yabin
>
> On Mon, Oct 17, 2016 at 1:48 AM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> To cassandra, the node where you deleted the files looks like a brand new
> machine. It doesn’t automatically rebuild machines to prevent accidental
> replacement. You need to tell it to build the “new” machines as a
> replacement for the “old” machine with that IP by setting -Dcassandra.replace_address_first_boot=<dead_node_ip>.
> See http://cassandra.apache.org/doc/latest/operating/topo_changes.html
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__cassandra.apache.org_doc_latest_operating_topo-5Fchanges.html&d=DQMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KGo0EnUT-Bop-0OnyQJRuFvNOf99S9tWEgziATmNfJ8&s=YazqmnV8TuuQXt9PDn0kFe6C08b7tQQXrqouXBCVVXE&e=>
> .
>
> Cheers
> Ben
>
> On Mon, 17 Oct 2016 at 16:41 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Hi all,
>
> A failure node can rejoin a cluster.
> On the node, all data in /var/lib/cassandra were deleted.
> Is it normal?
>
> I can reproduce it as below.
>
> cluster:
> - C* 2.2.7
> - a cluster has node1, 2, 3
> - node1 is a seed
> - replication_factor: 3
>
> how to:
> 1) stop C* process and delete all data in /var/lib/cassandra on node2
> ($sudo rm -rf /var/lib/cassandra/*)
> 2) stop C* process on node1 and node3
> 3) restart C* on node1
> 4) restart C* on node2
>
> nodetool status after 4):
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens       Owns (effective)  Host ID
>                         Rack
> DN  [node3 IP]  ?                 256          100.0%
>  325553c6-3e05-41f6-a1f7-47436743816f  rack1
> UN  [node2 IP]  7.76 MB      256          100.0%
>  05bdb1d4-c39b-48f1-8248-911d61935925  rack1
> UN  [node1 IP]  416.13 MB  256          100.0%
>  a8ec0a31-cb92-44b0-b156-5bcd4f6f2c7b  rack1
>
> If I restart C* on node 2 when C* on node1 and node3 are running (without
> 2), 3)), a runtime exception happens.
> RuntimeException: "A node with address [node2 IP] already exists,
> cancelling join..."
>
> I'm not sure this causes data lost. All data can be read properly just
> after this rejoin.
> But some rows are lost when I kill&restart C* for destructive tests after
> this rejoin.
>
> Thanks.
>
> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>
>
>
> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>
> ____________________________________________________________________
> CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and
> may be legally privileged. If you are not the intended recipient, do not
> disclose, copy, distribute, or use this email or any attachments. If you
> have received this in error please let the sender know and then delete the
> email and all attachments.
>
>
> --
————————
Ben Slater
Chief Product Officer
Instaclustr: Cassandra + Spark - Managed | Consulting | Support
+61 437 929 798

Re: failure node rejoin

Posted by Yuji Ito <yu...@imagine-orb.com>.

Thanks Ben, Jeff

Sorry that my explanation confused you.

Only node1 is the seed node.
Node2 whose C* data is deleted is NOT a seed.

I restarted the failure node(node2) after restarting the seed node(node1).
The restarting node2 succeeded without the exception.
(I couldn't restart node2 before restarting node1 as expected.)

Regards,


On Tue, Oct 18, 2016 at 1:06 PM, Jeff Jirsa <je...@crowdstrike.com>
wrote:

> The unstated "problem" here is that node1 is a seed, which implies
> auto_bootstrap=false (can't bootstrap a seed, so it was almost certainly
> setup to start without bootstrapping).
>
> That means once the data dir is wiped, it's going to start again without a
> bootstrap, and make a single node cluster or join an existing cluster if
> the seed list is valid
>
>
>
> --
> Jeff Jirsa
>
>
> On Oct 17, 2016, at 8:51 PM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> OK, sorry - I think understand what you are asking now.
>
> However, I’m still a little confused by your description. I think your
> scenario is:
> 1) Stop C* on all nodes in a cluster (Nodes A,B,C)
> 2) Delete all data from Node A
> 3) Restart Node A
> 4) Restart Node B,C
>
> Is this correct?
>
> If so, this isn’t a scenario I’ve tested/seen but I’m not surprised Node A
> starts succesfully as there are no running nodes to tell it via gossip that
> it shouldn’t start up without the “replaces” flag.
>
> I think that right way to recover in this scenario is to run a nodetool
> rebuild on Node A after the other two nodes are running. You could
> theoretically also run a repair (which would be good practice after a weird
> failure scenario like this) but rebuild will probably be quicker given you
> know all the data needs to be re-streamed.
>
> Cheers
> Ben
>
> On Tue, 18 Oct 2016 at 14:03 Yuji Ito <yu...@imagine-orb.com> wrote:
>
>> Thank you Ben, Yabin
>>
>> I understood the rejoin was illegal.
>> I expected this rejoin would fail with the exception.
>> But I could add the failure node to the cluster without the
>> exception after 2) and 3).
>> I want to know why the rejoin succeeds. Should the exception happen?
>>
>> Regards,
>>
>>
>> On Tue, Oct 18, 2016 at 1:51 AM, Yabin Meng <ya...@gmail.com> wrote:
>>
>> The exception you run into is expected behavior. This is because as Ben
>> pointed out, when you delete everything (including system schemas), C*
>> cluster thinks you're bootstrapping a new node. However,  node2's IP is
>> still in gossip and this is why you see the exception.
>>
>> I'm not clear the reasoning why you need to delete C* data directory.
>> That is a dangerous action, especially considering that you delete system
>> schemas. If in any case the failure node is gone for a while, what you need
>> to do is to is remove the node first before doing "rejoin".
>>
>> Cheers,
>>
>> Yabin
>>
>> On Mon, Oct 17, 2016 at 1:48 AM, Ben Slater <be...@instaclustr.com>
>> wrote:
>>
>> To cassandra, the node where you deleted the files looks like a brand new
>> machine. It doesn’t automatically rebuild machines to prevent accidental
>> replacement. You need to tell it to build the “new” machines as a
>> replacement for the “old” machine with that IP by setting
>> -Dcassandra.replace_address_first_boot=<dead_node_ip>. See
>> http://cassandra.apache.org/doc/latest/operating/topo_changes.html
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__cassandra.apache.org_doc_latest_operating_topo-5Fchanges.html&d=DQMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KGo0EnUT-Bop-0OnyQJRuFvNOf99S9tWEgziATmNfJ8&s=YazqmnV8TuuQXt9PDn0kFe6C08b7tQQXrqouXBCVVXE&e=>
>> .
>>
>> Cheers
>> Ben
>>
>> On Mon, 17 Oct 2016 at 16:41 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>> Hi all,
>>
>> A failure node can rejoin a cluster.
>> On the node, all data in /var/lib/cassandra were deleted.
>> Is it normal?
>>
>> I can reproduce it as below.
>>
>> cluster:
>> - C* 2.2.7
>> - a cluster has node1, 2, 3
>> - node1 is a seed
>> - replication_factor: 3
>>
>> how to:
>> 1) stop C* process and delete all data in /var/lib/cassandra on node2
>> ($sudo rm -rf /var/lib/cassandra/*)
>> 2) stop C* process on node1 and node3
>> 3) restart C* on node1
>> 4) restart C* on node2
>>
>> nodetool status after 4):
>> Datacenter: datacenter1
>> =======================
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address        Load       Tokens       Owns (effective)  Host ID
>>                           Rack
>> DN  [node3 IP]  ?                 256          100.0%
>>  325553c6-3e05-41f6-a1f7-47436743816f  rack1
>> UN  [node2 IP]  7.76 MB      256          100.0%
>>  05bdb1d4-c39b-48f1-8248-911d61935925  rack1
>> UN  [node1 IP]  416.13 MB  256          100.0%
>>  a8ec0a31-cb92-44b0-b156-5bcd4f6f2c7b  rack1
>>
>> If I restart C* on node 2 when C* on node1 and node3 are running (without
>> 2), 3)), a runtime exception happens.
>> RuntimeException: "A node with address [node2 IP] already exists,
>> cancelling join..."
>>
>> I'm not sure this causes data lost. All data can be read properly just
>> after this rejoin.
>> But some rows are lost when I kill&restart C* for destructive tests after
>> this rejoin.
>>
>> Thanks.
>>
>> --
>> ————————
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
>>
>>
>>
>> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>
> ____________________________________________________________________
> CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and
> may be legally privileged. If you are not the intended recipient, do not
> disclose, copy, distribute, or use this email or any attachments. If you
> have received this in error please let the sender know and then delete the
> email and all attachments.
>

Re: failure node rejoin

Posted by Jeff Jirsa <je...@crowdstrike.com>.

The unstated "problem" here is that node1 is a seed, which implies auto_bootstrap=false (can't bootstrap a seed, so it was almost certainly setup to start without bootstrapping).

That means once the data dir is wiped, it's going to start again without a bootstrap, and make a single node cluster or join an existing cluster if the seed list is valid



-- 
Jeff Jirsa


> On Oct 17, 2016, at 8:51 PM, Ben Slater <be...@instaclustr.com> wrote:
> 
> OK, sorry - I think understand what you are asking now. 
> 
> However, I’m still a little confused by your description. I think your scenario is:
> 1) Stop C* on all nodes in a cluster (Nodes A,B,C)
> 2) Delete all data from Node A
> 3) Restart Node A
> 4) Restart Node B,C
> 
> Is this correct?
> 
> If so, this isn’t a scenario I’ve tested/seen but I’m not surprised Node A starts succesfully as there are no running nodes to tell it via gossip that it shouldn’t start up without the “replaces” flag.
> 
> I think that right way to recover in this scenario is to run a nodetool rebuild on Node A after the other two nodes are running. You could theoretically also run a repair (which would be good practice after a weird failure scenario like this) but rebuild will probably be quicker given you know all the data needs to be re-streamed.
> 
> Cheers
> Ben
> 
>> On Tue, 18 Oct 2016 at 14:03 Yuji Ito <yu...@imagine-orb.com> wrote:
>> Thank you Ben, Yabin
>> 
>> I understood the rejoin was illegal.
>> I expected this rejoin would fail with the exception.
>> But I could add the failure node to the cluster without the exception after 2) and 3).
>> I want to know why the rejoin succeeds. Should the exception happen?
>> 
>> Regards,
>> 
>> 
>> On Tue, Oct 18, 2016 at 1:51 AM, Yabin Meng <ya...@gmail.com> wrote:
>> The exception you run into is expected behavior. This is because as Ben pointed out, when you delete everything (including system schemas), C* cluster thinks you're bootstrapping a new node. However,  node2's IP is still in gossip and this is why you see the exception.
>> 
>> I'm not clear the reasoning why you need to delete C* data directory. That is a dangerous action, especially considering that you delete system schemas. If in any case the failure node is gone for a while, what you need to do is to is remove the node first before doing "rejoin".
>> 
>> Cheers,
>> 
>> Yabin
>> 
>> On Mon, Oct 17, 2016 at 1:48 AM, Ben Slater <be...@instaclustr.com> wrote:
>> To cassandra, the node where you deleted the files looks like a brand new machine. It doesn’t automatically rebuild machines to prevent accidental replacement. You need to tell it to build the “new” machines as a replacement for the “old” machine with that IP by setting -Dcassandra.replace_address_first_boot=<dead_node_ip>. See http://cassandra.apache.org/doc/latest/operating/topo_changes.html.
>> 
>> Cheers
>> Ben
>> 
>> On Mon, 17 Oct 2016 at 16:41 Yuji Ito <yu...@imagine-orb.com> wrote:
>> Hi all,
>> 
>> A failure node can rejoin a cluster.
>> On the node, all data in /var/lib/cassandra were deleted.
>> Is it normal?
>> 
>> I can reproduce it as below.
>> 
>> cluster:
>> - C* 2.2.7
>> - a cluster has node1, 2, 3
>> - node1 is a seed
>> - replication_factor: 3
>> 
>> how to:
>> 1) stop C* process and delete all data in /var/lib/cassandra on node2 ($sudo rm -rf /var/lib/cassandra/*)
>> 2) stop C* process on node1 and node3
>> 3) restart C* on node1
>> 4) restart C* on node2
>> 
>> nodetool status after 4):
>> Datacenter: datacenter1
>> =======================
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
>> DN  [node3 IP]  ?                 256          100.0%            325553c6-3e05-41f6-a1f7-47436743816f  rack1
>> UN  [node2 IP]  7.76 MB      256          100.0%            05bdb1d4-c39b-48f1-8248-911d61935925  rack1
>> UN  [node1 IP]  416.13 MB  256          100.0%            a8ec0a31-cb92-44b0-b156-5bcd4f6f2c7b  rack1
>> 
>> If I restart C* on node 2 when C* on node1 and node3 are running (without 2), 3)), a runtime exception happens.
>> RuntimeException: "A node with address [node2 IP] already exists, cancelling join..."
>> 
>> I'm not sure this causes data lost. All data can be read properly just after this rejoin.
>> But some rows are lost when I kill&restart C* for destructive tests after this rejoin.
>> 
>> Thanks.
>> 
>> -- 
>> ————————
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
>> 
>> 
> 
> -- 
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798

____________________________________________________________________
CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments.

Re: failure node rejoin

Posted by Ben Slater <be...@instaclustr.com>.

OK, sorry - I think understand what you are asking now.

However, I’m still a little confused by your description. I think your
scenario is:
1) Stop C* on all nodes in a cluster (Nodes A,B,C)
2) Delete all data from Node A
3) Restart Node A
4) Restart Node B,C

Is this correct?

If so, this isn’t a scenario I’ve tested/seen but I’m not surprised Node A
starts succesfully as there are no running nodes to tell it via gossip that
it shouldn’t start up without the “replaces” flag.

I think that right way to recover in this scenario is to run a nodetool
rebuild on Node A after the other two nodes are running. You could
theoretically also run a repair (which would be good practice after a weird
failure scenario like this) but rebuild will probably be quicker given you
know all the data needs to be re-streamed.

Cheers
Ben

On Tue, 18 Oct 2016 at 14:03 Yuji Ito <yu...@imagine-orb.com> wrote:

> Thank you Ben, Yabin
>
> I understood the rejoin was illegal.
> I expected this rejoin would fail with the exception.
> But I could add the failure node to the cluster without the
> exception after 2) and 3).
> I want to know why the rejoin succeeds. Should the exception happen?
>
> Regards,
>
>
> On Tue, Oct 18, 2016 at 1:51 AM, Yabin Meng <ya...@gmail.com> wrote:
>
> The exception you run into is expected behavior. This is because as Ben
> pointed out, when you delete everything (including system schemas), C*
> cluster thinks you're bootstrapping a new node. However,  node2's IP is
> still in gossip and this is why you see the exception.
>
> I'm not clear the reasoning why you need to delete C* data directory. That
> is a dangerous action, especially considering that you delete system
> schemas. If in any case the failure node is gone for a while, what you need
> to do is to is remove the node first before doing "rejoin".
>
> Cheers,
>
> Yabin
>
> On Mon, Oct 17, 2016 at 1:48 AM, Ben Slater <be...@instaclustr.com>
> wrote:
>
> To cassandra, the node where you deleted the files looks like a brand new
> machine. It doesn’t automatically rebuild machines to prevent accidental
> replacement. You need to tell it to build the “new” machines as a
> replacement for the “old” machine with that IP by setting -Dcassandra.replace_address_first_boot=<dead_node_ip>.
> See http://cassandra.apache.org/doc/latest/operating/topo_changes.html.
>
> Cheers
> Ben
>
> On Mon, 17 Oct 2016 at 16:41 Yuji Ito <yu...@imagine-orb.com> wrote:
>
> Hi all,
>
> A failure node can rejoin a cluster.
> On the node, all data in /var/lib/cassandra were deleted.
> Is it normal?
>
> I can reproduce it as below.
>
> cluster:
> - C* 2.2.7
> - a cluster has node1, 2, 3
> - node1 is a seed
> - replication_factor: 3
>
> how to:
> 1) stop C* process and delete all data in /var/lib/cassandra on node2
> ($sudo rm -rf /var/lib/cassandra/*)
> 2) stop C* process on node1 and node3
> 3) restart C* on node1
> 4) restart C* on node2
>
> nodetool status after 4):
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens       Owns (effective)  Host ID
>                         Rack
> DN  [node3 IP]  ?                 256          100.0%
>  325553c6-3e05-41f6-a1f7-47436743816f  rack1
> UN  [node2 IP]  7.76 MB      256          100.0%
>  05bdb1d4-c39b-48f1-8248-911d61935925  rack1
> UN  [node1 IP]  416.13 MB  256          100.0%
>  a8ec0a31-cb92-44b0-b156-5bcd4f6f2c7b  rack1
>
> If I restart C* on node 2 when C* on node1 and node3 are running (without
> 2), 3)), a runtime exception happens.
> RuntimeException: "A node with address [node2 IP] already exists,
> cancelling join..."
>
> I'm not sure this causes data lost. All data can be read properly just
> after this rejoin.
> But some rows are lost when I kill&restart C* for destructive tests after
> this rejoin.
>
> Thanks.
>
> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>
>
>
> --
————————
Ben Slater
Chief Product Officer
Instaclustr: Cassandra + Spark - Managed | Consulting | Support
+61 437 929 798

Re: failure node rejoin

Posted by Yuji Ito <yu...@imagine-orb.com>.

Thank you Ben, Yabin

I understood the rejoin was illegal.
I expected this rejoin would fail with the exception.
But I could add the failure node to the cluster without the exception after
2) and 3).
I want to know why the rejoin succeeds. Should the exception happen?

Regards,


On Tue, Oct 18, 2016 at 1:51 AM, Yabin Meng <ya...@gmail.com> wrote:

> The exception you run into is expected behavior. This is because as Ben
> pointed out, when you delete everything (including system schemas), C*
> cluster thinks you're bootstrapping a new node. However,  node2's IP is
> still in gossip and this is why you see the exception.
>
> I'm not clear the reasoning why you need to delete C* data directory. That
> is a dangerous action, especially considering that you delete system
> schemas. If in any case the failure node is gone for a while, what you need
> to do is to is remove the node first before doing "rejoin".
>
> Cheers,
>
> Yabin
>
> On Mon, Oct 17, 2016 at 1:48 AM, Ben Slater <be...@instaclustr.com>
> wrote:
>
>> To cassandra, the node where you deleted the files looks like a brand new
>> machine. It doesn’t automatically rebuild machines to prevent accidental
>> replacement. You need to tell it to build the “new” machines as a
>> replacement for the “old” machine with that IP by setting
>> -Dcassandra.replace_address_first_boot=<dead_node_ip>. See
>> http://cassandra.apache.org/doc/latest/operating/topo_changes.html.
>>
>> Cheers
>> Ben
>>
>> On Mon, 17 Oct 2016 at 16:41 Yuji Ito <yu...@imagine-orb.com> wrote:
>>
>>> Hi all,
>>>
>>> A failure node can rejoin a cluster.
>>> On the node, all data in /var/lib/cassandra were deleted.
>>> Is it normal?
>>>
>>> I can reproduce it as below.
>>>
>>> cluster:
>>> - C* 2.2.7
>>> - a cluster has node1, 2, 3
>>> - node1 is a seed
>>> - replication_factor: 3
>>>
>>> how to:
>>> 1) stop C* process and delete all data in /var/lib/cassandra on node2
>>> ($sudo rm -rf /var/lib/cassandra/*)
>>> 2) stop C* process on node1 and node3
>>> 3) restart C* on node1
>>> 4) restart C* on node2
>>>
>>> nodetool status after 4):
>>> Datacenter: datacenter1
>>> =======================
>>> Status=Up/Down
>>> |/ State=Normal/Leaving/Joining/Moving
>>> --  Address        Load       Tokens       Owns (effective)  Host ID
>>>                           Rack
>>> DN  [node3 IP]  ?                 256          100.0%
>>>  325553c6-3e05-41f6-a1f7-47436743816f  rack1
>>> UN  [node2 IP]  7.76 MB      256          100.0%
>>>  05bdb1d4-c39b-48f1-8248-911d61935925  rack1
>>> UN  [node1 IP]  416.13 MB  256          100.0%
>>>  a8ec0a31-cb92-44b0-b156-5bcd4f6f2c7b  rack1
>>>
>>> If I restart C* on node 2 when C* on node1 and node3 are running
>>> (without 2), 3)), a runtime exception happens.
>>> RuntimeException: "A node with address [node2 IP] already exists,
>>> cancelling join..."
>>>
>>> I'm not sure this causes data lost. All data can be read properly just
>>> after this rejoin.
>>> But some rows are lost when I kill&restart C* for destructive tests
>>> after this rejoin.
>>>
>>> Thanks.
>>>
>>> --
>> ————————
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
>>
>
>

Re: failure node rejoin

Posted by Yabin Meng <ya...@gmail.com>.

The exception you run into is expected behavior. This is because as Ben
pointed out, when you delete everything (including system schemas), C*
cluster thinks you're bootstrapping a new node. However,  node2's IP is
still in gossip and this is why you see the exception.

I'm not clear the reasoning why you need to delete C* data directory. That
is a dangerous action, especially considering that you delete system
schemas. If in any case the failure node is gone for a while, what you need
to do is to is remove the node first before doing "rejoin".

Cheers,

Yabin

On Mon, Oct 17, 2016 at 1:48 AM, Ben Slater <be...@instaclustr.com>
wrote:

> To cassandra, the node where you deleted the files looks like a brand new
> machine. It doesn’t automatically rebuild machines to prevent accidental
> replacement. You need to tell it to build the “new” machines as a
> replacement for the “old” machine with that IP by setting
> -Dcassandra.replace_address_first_boot=<dead_node_ip>. See
> http://cassandra.apache.org/doc/latest/operating/topo_changes.html.
>
> Cheers
> Ben
>
> On Mon, 17 Oct 2016 at 16:41 Yuji Ito <yu...@imagine-orb.com> wrote:
>
>> Hi all,
>>
>> A failure node can rejoin a cluster.
>> On the node, all data in /var/lib/cassandra were deleted.
>> Is it normal?
>>
>> I can reproduce it as below.
>>
>> cluster:
>> - C* 2.2.7
>> - a cluster has node1, 2, 3
>> - node1 is a seed
>> - replication_factor: 3
>>
>> how to:
>> 1) stop C* process and delete all data in /var/lib/cassandra on node2
>> ($sudo rm -rf /var/lib/cassandra/*)
>> 2) stop C* process on node1 and node3
>> 3) restart C* on node1
>> 4) restart C* on node2
>>
>> nodetool status after 4):
>> Datacenter: datacenter1
>> =======================
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address        Load       Tokens       Owns (effective)  Host ID
>>                           Rack
>> DN  [node3 IP]  ?                 256          100.0%
>>  325553c6-3e05-41f6-a1f7-47436743816f  rack1
>> UN  [node2 IP]  7.76 MB      256          100.0%
>>  05bdb1d4-c39b-48f1-8248-911d61935925  rack1
>> UN  [node1 IP]  416.13 MB  256          100.0%
>>  a8ec0a31-cb92-44b0-b156-5bcd4f6f2c7b  rack1
>>
>> If I restart C* on node 2 when C* on node1 and node3 are running (without
>> 2), 3)), a runtime exception happens.
>> RuntimeException: "A node with address [node2 IP] already exists,
>> cancelling join..."
>>
>> I'm not sure this causes data lost. All data can be read properly just
>> after this rejoin.
>> But some rows are lost when I kill&restart C* for destructive tests after
>> this rejoin.
>>
>> Thanks.
>>
>> --
> ————————
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>

Re: failure node rejoin

Posted by Ben Slater <be...@instaclustr.com>.

To cassandra, the node where you deleted the files looks like a brand new
machine. It doesn’t automatically rebuild machines to prevent accidental
replacement. You need to tell it to build the “new” machines as a
replacement for the “old” machine with that IP by setting
-Dcassandra.replace_address_first_boot=<dead_node_ip>.
See http://cassandra.apache.org/doc/latest/operating/topo_changes.html.

Cheers
Ben

On Mon, 17 Oct 2016 at 16:41 Yuji Ito <yu...@imagine-orb.com> wrote:

> Hi all,
>
> A failure node can rejoin a cluster.
> On the node, all data in /var/lib/cassandra were deleted.
> Is it normal?
>
> I can reproduce it as below.
>
> cluster:
> - C* 2.2.7
> - a cluster has node1, 2, 3
> - node1 is a seed
> - replication_factor: 3
>
> how to:
> 1) stop C* process and delete all data in /var/lib/cassandra on node2
> ($sudo rm -rf /var/lib/cassandra/*)
> 2) stop C* process on node1 and node3
> 3) restart C* on node1
> 4) restart C* on node2
>
> nodetool status after 4):
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens       Owns (effective)  Host ID
>                         Rack
> DN  [node3 IP]  ?                 256          100.0%
>  325553c6-3e05-41f6-a1f7-47436743816f  rack1
> UN  [node2 IP]  7.76 MB      256          100.0%
>  05bdb1d4-c39b-48f1-8248-911d61935925  rack1
> UN  [node1 IP]  416.13 MB  256          100.0%
>  a8ec0a31-cb92-44b0-b156-5bcd4f6f2c7b  rack1
>
> If I restart C* on node 2 when C* on node1 and node3 are running (without
> 2), 3)), a runtime exception happens.
> RuntimeException: "A node with address [node2 IP] already exists,
> cancelling join..."
>
> I'm not sure this causes data lost. All data can be read properly just
> after this rejoin.
> But some rows are lost when I kill&restart C* for destructive tests after
> this rejoin.
>
> Thanks.
>
> --
————————
Ben Slater
Chief Product Officer
Instaclustr: Cassandra + Spark - Managed | Consulting | Support
+61 437 929 798