You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by mcasandra <mo...@gmail.com> on 2011/03/03 18:39:32 UTC

Error when bringing up nodes during failure testing

Whenever I do failure testing I see this error message and then cassandra
process exits. This is what I am doing: 


1. 3 node cluster. CF of RF=3, W=QUORUM and R=QUORUM 
2. Execute client code in a loop which just reads data from CF in while
loop. 
2. Bring one node down (Node C). Everything ok. Client is happy (as
expected) 
3. Bring one more node down (Node A). Client throws error (as expected) 
4. Bring one node up and then I receive following error message in cassandra
and cassandra exits at this point. 

Please help. But sometimes when I bring some other node up first (Node C)
and then bring up this node (A)then it works. Not sure what's going on here. 

Error in Cassandra logs:

ERROR 15:36:55,153 Exception encountered during startup. 
java.lang.IllegalStateException: replication factor (3) exceeds number of
endpoints (2) 
        at
org.apache.cassandra.locator.SimpleStrategy.calculateNaturalEndpoints(SimpleStrategy.java:60) 
        at
org.apache.cassandra.locator.AbstractReplicationStrategy.getRangeAddresses(AbstractReplicationStrategy.java:204) 


--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Error-when-bringing-up-nodes-during-failure-testing-tp6085692p6085692.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Exception when bringing up nodes during failure testing

Posted by mcasandra <mo...@gmail.com>.

Is this a bug or something I am doing wrong? Can't get past this now.

--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6089344.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Error when bringing up nodes during failure testing

Posted by mcasandra <mo...@gmail.com>.

aaron morton wrote:
> 
> Can you include the full error stack ? 
> 
> It's failing because of the reason stated. But I need some more info to
> understand what part of the startup process it's stuck at. 
> 
> 
Thanks for responding! I'll send it as soon as I can get on my network. But
you mentioned that someone already stated the reason but I can't find it in
this thread. Did I miss it?

--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6093527.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Error when bringing up nodes during failure testing

Posted by Jonathan Ellis <jb...@gmail.com>.

Is he trying to bootstrap?  What does that have to do with failure
recovery?  Doesn't make sense to me.

On Tue, Mar 8, 2011 at 2:33 AM, aaron morton <aa...@thelastpickle.com> wrote:
> It looks like the node is sending out it application state and waiting the required time after which it expects to know about all other nodes in the cluster.
>
>> INFO [main] 2011-03-07 17:04:06,660 StorageService.java (line 399) Joining: sleeping 30000 ms for pending range setup
> For some reason it cannot see them. This could be a config thing or a networking thing.
>
> I was a bit off in my analysis before. When boot strapping it's smart enough to wait for gossip to kick in and tell the node about the others in the cluster.
>
> Try the following:
> - check network connectivity between the problem node and the others, and check they have the same config
> - try to bring up the problem node with auto_bootstrap off . If it can get start check it's view of the cluster with nodetool ring
> - if that fails turn on TRACE logging on all nodes, and try to bring up the problem node. This will log a lot of messages about what Gossip is doing.
>
> Aaron
>
> On 8/03/2011, at 2:49 PM, mcasandra wrote:
>
>>
>> aaron morton wrote:
>>>
>>> 2) um, not sure. The nodetool output below looks like there are only 2
>>> nodes in that cluster, i.e. there are no down nodes.
>>>
>> There are actually 3 nodes. Not sure why it's not showing the other node in
>> the output which is currently down. The error I am getting is from the the
>> 3rd node that is currently down.
>>
>> Here are the logs which shows it tried to talk to other 2 nodes:
>>
>> ---
>> INFO [HintedHandoff:1] 2011-03-07 17:02:36,463 HintedHandOffManager.java
>> (line 248) Finished hinted handoff of 0 rows to endpoint /181.116.206.179
>> INFO [GossipStage:1] 2011-03-07 17:02:36,463 StorageService.java (line 606)
>> Node /181.116.208.68 state jump to normal
>> INFO [HintedHandoff:1] 2011-03-07 17:02:36,463 HintedHandOffManager.java
>> (line 192) Started hinted handoff for endpoint /181.116.208.68
>> INFO [HintedHandoff:1] 2011-03-07 17:02:36,464 HintedHandOffManager.java
>> (line 248) Finished hinted handoff of 0 rows to endpoint /181.116.208.68
>> INFO [main] 2011-03-07 17:04:06,424 StorageService.java (line 399) Joining:
>> getting bootstrap token
>> INFO [main] 2011-03-07 17:04:06,426 ColumnFamilyStore.java (line 648)
>> switching in a fresh Memtable for LocationInfo at
>> CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1299546155643.log',
>> position=296)
>> INFO [main] 2011-03-07 17:04:06,426 ColumnFamilyStore.java (line 952)
>> Enqueuing flush of Memtable-LocationInfo@1367996500(36 bytes, 1 operations)
>> INFO [FlushWriter:1] 2011-03-07 17:04:06,427 Memtable.java (line 155)
>> Writing Memtable-LocationInfo@1367996500(36 bytes, 1 operations)
>> INFO [FlushWriter:1] 2011-03-07 17:04:06,659 Memtable.java (line 162)
>> Completed flushing /var/lib/cassandra/data/system/LocationInfo-e-80-Data.db
>> (156 bytes)
>> INFO [CompactionExecutor:1] 2011-03-07 17:04:06,660 CompactionManager.java
>> (line 272) Compacting
>> [org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-77-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-78-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-79-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-80-Data.db')]
>> INFO [main] 2011-03-07 17:04:06,660 StorageService.java (line 399) Joining:
>> sleeping 30000 ms for pending range setup
>> INFO [CompactionExecutor:1] 2011-03-07 17:04:06,849 CompactionManager.java
>> (line 354) Compacted to
>> /var/lib/cassandra/data/system/LocationInfo-tmp-e-81-Data.db.  1,293 to 832
>> (~64% of original) bytes for 4 keys.  Time: 185ms.
>> INFO [main] 2011-03-07 17:04:36,667 StorageService.java (line 399)
>> Bootstrapping
>> ERROR [main] 2011-03-07 17:04:36,677 AbstractCassandraDaemon.java (line 234)
>> Exception encountered during startup.
>> java.lang.IllegalStateException: replication factor (3) exceeds number of
>> endpoints (2)
>> ----
>>
>>
>> --
>> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6099853.html
>> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Error when bringing up nodes during failure testing

Posted by Peter Schuller <pe...@infidyne.com>.

> Also, what are the disadvantage of turning off auto bootstrap? Do I need to
> do anything after the fact?

Inserting a new node into a ring without auto_bootstrap implies that
it will join the ring, but will not contain any data for which it is
supposedly responsible. A 'nodetool repair' should cause data to be
replicated. But until that's done, the node should be returning
inconsistent results.

So, turning off auto_bootstrap probably just hid/changed the symptom
of the problem you're seeing rather than fix it./

-- 
/ Peter Schuller

Re: Error when bringing up nodes during failure testing

Posted by mcasandra <mo...@gmail.com>.

I turned the auto_bootstrap off and it worked fine. I don't think it's
connectivity issue or network issue at all. I am very confused about what's
going on here. Can you please let me know if this a bug that I am facing?


Also, what are the disadvantage of turning off auto bootstrap? Do I need to
do anything after the fact?

I don't see any nodetool join option in nodetool as stated previously.

--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6131917.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Error when bringing up nodes during failure testing

Posted by aaron morton <aa...@thelastpickle.com>.

It looks like the node is sending out it application state and waiting the required time after which it expects to know about all other nodes in the cluster. 

> INFO [main] 2011-03-07 17:04:06,660 StorageService.java (line 399) Joining: sleeping 30000 ms for pending range setup
For some reason it cannot see them. This could be a config thing or a networking thing. 

I was a bit off in my analysis before. When boot strapping it's smart enough to wait for gossip to kick in and tell the node about the others in the cluster. 

Try the following:
- check network connectivity between the problem node and the others, and check they have the same config
- try to bring up the problem node with auto_bootstrap off . If it can get start check it's view of the cluster with nodetool ring
- if that fails turn on TRACE logging on all nodes, and try to bring up the problem node. This will log a lot of messages about what Gossip is doing.
 
Aaron

On 8/03/2011, at 2:49 PM, mcasandra wrote:

> 
> aaron morton wrote:
>> 
>> 2) um, not sure. The nodetool output below looks like there are only 2
>> nodes in that cluster, i.e. there are no down nodes. 
>> 
> There are actually 3 nodes. Not sure why it's not showing the other node in
> the output which is currently down. The error I am getting is from the the
> 3rd node that is currently down.
> 
> Here are the logs which shows it tried to talk to other 2 nodes:
> 
> ---
> INFO [HintedHandoff:1] 2011-03-07 17:02:36,463 HintedHandOffManager.java
> (line 248) Finished hinted handoff of 0 rows to endpoint /181.116.206.179
> INFO [GossipStage:1] 2011-03-07 17:02:36,463 StorageService.java (line 606)
> Node /181.116.208.68 state jump to normal
> INFO [HintedHandoff:1] 2011-03-07 17:02:36,463 HintedHandOffManager.java
> (line 192) Started hinted handoff for endpoint /181.116.208.68
> INFO [HintedHandoff:1] 2011-03-07 17:02:36,464 HintedHandOffManager.java
> (line 248) Finished hinted handoff of 0 rows to endpoint /181.116.208.68
> INFO [main] 2011-03-07 17:04:06,424 StorageService.java (line 399) Joining:
> getting bootstrap token
> INFO [main] 2011-03-07 17:04:06,426 ColumnFamilyStore.java (line 648)
> switching in a fresh Memtable for LocationInfo at
> CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1299546155643.log',
> position=296)
> INFO [main] 2011-03-07 17:04:06,426 ColumnFamilyStore.java (line 952)
> Enqueuing flush of Memtable-LocationInfo@1367996500(36 bytes, 1 operations)
> INFO [FlushWriter:1] 2011-03-07 17:04:06,427 Memtable.java (line 155)
> Writing Memtable-LocationInfo@1367996500(36 bytes, 1 operations)
> INFO [FlushWriter:1] 2011-03-07 17:04:06,659 Memtable.java (line 162)
> Completed flushing /var/lib/cassandra/data/system/LocationInfo-e-80-Data.db
> (156 bytes)
> INFO [CompactionExecutor:1] 2011-03-07 17:04:06,660 CompactionManager.java
> (line 272) Compacting
> [org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-77-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-78-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-79-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-80-Data.db')]
> INFO [main] 2011-03-07 17:04:06,660 StorageService.java (line 399) Joining:
> sleeping 30000 ms for pending range setup
> INFO [CompactionExecutor:1] 2011-03-07 17:04:06,849 CompactionManager.java
> (line 354) Compacted to
> /var/lib/cassandra/data/system/LocationInfo-tmp-e-81-Data.db.  1,293 to 832
> (~64% of original) bytes for 4 keys.  Time: 185ms.
> INFO [main] 2011-03-07 17:04:36,667 StorageService.java (line 399)
> Bootstrapping
> ERROR [main] 2011-03-07 17:04:36,677 AbstractCassandraDaemon.java (line 234)
> Exception encountered during startup.
> java.lang.IllegalStateException: replication factor (3) exceeds number of
> endpoints (2)
> ----
> 
> 
> --
> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6099853.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Error when bringing up nodes during failure testing

Posted by mcasandra <mo...@gmail.com>.

aaron morton wrote:
> 
> 2) um, not sure. The nodetool output below looks like there are only 2
> nodes in that cluster, i.e. there are no down nodes. 
> 
There are actually 3 nodes. Not sure why it's not showing the other node in
the output which is currently down. The error I am getting is from the the
3rd node that is currently down.

Here are the logs which shows it tried to talk to other 2 nodes:

---
 INFO [HintedHandoff:1] 2011-03-07 17:02:36,463 HintedHandOffManager.java
(line 248) Finished hinted handoff of 0 rows to endpoint /181.116.206.179
 INFO [GossipStage:1] 2011-03-07 17:02:36,463 StorageService.java (line 606)
Node /181.116.208.68 state jump to normal
 INFO [HintedHandoff:1] 2011-03-07 17:02:36,463 HintedHandOffManager.java
(line 192) Started hinted handoff for endpoint /181.116.208.68
 INFO [HintedHandoff:1] 2011-03-07 17:02:36,464 HintedHandOffManager.java
(line 248) Finished hinted handoff of 0 rows to endpoint /181.116.208.68
 INFO [main] 2011-03-07 17:04:06,424 StorageService.java (line 399) Joining:
getting bootstrap token
 INFO [main] 2011-03-07 17:04:06,426 ColumnFamilyStore.java (line 648)
switching in a fresh Memtable for LocationInfo at
CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1299546155643.log',
position=296)
 INFO [main] 2011-03-07 17:04:06,426 ColumnFamilyStore.java (line 952)
Enqueuing flush of Memtable-LocationInfo@1367996500(36 bytes, 1 operations)
 INFO [FlushWriter:1] 2011-03-07 17:04:06,427 Memtable.java (line 155)
Writing Memtable-LocationInfo@1367996500(36 bytes, 1 operations)
 INFO [FlushWriter:1] 2011-03-07 17:04:06,659 Memtable.java (line 162)
Completed flushing /var/lib/cassandra/data/system/LocationInfo-e-80-Data.db
(156 bytes)
 INFO [CompactionExecutor:1] 2011-03-07 17:04:06,660 CompactionManager.java
(line 272) Compacting
[org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-77-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-78-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-79-Data.db'),org.apache.cassandra.io.sstable.SSTableReader(path='/var/lib/cassandra/data/system/LocationInfo-e-80-Data.db')]
 INFO [main] 2011-03-07 17:04:06,660 StorageService.java (line 399) Joining:
sleeping 30000 ms for pending range setup
 INFO [CompactionExecutor:1] 2011-03-07 17:04:06,849 CompactionManager.java
(line 354) Compacted to
/var/lib/cassandra/data/system/LocationInfo-tmp-e-81-Data.db.  1,293 to 832
(~64% of original) bytes for 4 keys.  Time: 185ms.
 INFO [main] 2011-03-07 17:04:36,667 StorageService.java (line 399)
Bootstrapping
ERROR [main] 2011-03-07 17:04:36,677 AbstractCassandraDaemon.java (line 234)
Exception encountered during startup.
java.lang.IllegalStateException: replication factor (3) exceeds number of
endpoints (2)
----


--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6099853.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Error when bringing up nodes during failure testing

Posted by aaron morton <aa...@thelastpickle.com>.

1) yes

2) um, not sure. The nodetool output below looks like there are only 2 nodes in that cluster, i.e. there are no down nodes. 

Aaron

On 8/03/2011, at 2:11 PM, mcasandra wrote:

> 
> aaron morton wrote:
>> 
>> It's failing because when the node bootstraps it does not know about
>> enough nodes to support the RF...
>> 
>>> replication factor (3) exceeds number of
>>> endpoints (2)
>> 
>> I *think* the normal work around is to disable autobootstrap, bring the
>> nodes up then run "nodetool join" or StorageService.joinRing() via the
>> JConsole.
>> 
>> I not tested this, but reading the code that looks OK. Can you try it out
>> and let me know how it goes?
>> 
>> Aaron
>> 
> 
> I am getting confused about the behaviour:
> 
> 1) Out of 3 nodes I have 2 nodes up and I am trying to start this node
> that's failing. Is this expected that even though there are 2 nodes up one
> node will continuously fail with "replication factor (3) exceeds .."
> message?
> 
> 2) When I brought 2 nodes down (out of 3), I was able to start one node
> (with 66 % load below) even though auto_bootstrap is set to true. Shouldn't
> it have failed for the same reason?
> 
> $ nodetool -h `hostname` ring
> Address         Status State   Load            Owns    Token
> 
> 113427455640312821154458202477256070484
> 181.116.206.179  Up     Normal  645.13 KB       33.33%  0
> 181.116.208.68   Up     Normal  640.16 KB       66.67% 
> 113427455640312821154458202477256070484
> 
> 
> --
> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6099765.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Error when bringing up nodes during failure testing

Posted by mcasandra <mo...@gmail.com>.

I am as clear as mud with what is happening here :)

But with some suggestions I can try to start my test from scratch and post
results in that order.

--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6135635.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Error when bringing up nodes during failure testing

Posted by Peter Schuller <pe...@infidyne.com>.

> 2) When I brought 2 nodes down (out of 3), I was able to start one node
> (with 66 % load below) even though auto_bootstrap is set to true. Shouldn't
> it have failed for the same reason?

This is a good point/question. As far as I can tell, a node being
bootstrapped would need to receive data from a sufficient number of
replicas to satisfy the maximum consistently level that the
application(s) use, in order to avoid the potential for violating the
consistency requirement expected by clients. Not knowing what the
application expects, that would imply a quorum of nodes.

I just checked the code, and my reading (untested) is that the intent
is to receive data from all nodes responsible for the part of the ring
that is being taken over. Meaning, it satisfies the above requirement.

However, that reading is inconsistent with your test which suggests
you were able to bootstrap with two nodes missing out of three.

Is your nodetool output from the new node or the pre-existing online
node? It only lists two nodes, rather than 3 or 4 (with some being
Down). If the only remaining node doesn't know about the other two
that are down, that may explain it.

I may be mis-reading the code because it's suddenly unclear to me how
this is supposed to work with respect to nodes being down (supposing
it's truly down, forever, and needs to be replaced).

Anyone?

-- 
/ Peter Schuller

Re: Error when bringing up nodes during failure testing

Posted by mcasandra <mo...@gmail.com>.

aaron morton wrote:
> 
> It's failing because when the node bootstraps it does not know about
> enough nodes to support the RF...
> 
>> replication factor (3) exceeds number of
>> endpoints (2)
> 
> I *think* the normal work around is to disable autobootstrap, bring the
> nodes up then run "nodetool join" or StorageService.joinRing() via the
> JConsole.
> 
> I not tested this, but reading the code that looks OK. Can you try it out
> and let me know how it goes?
> 
> Aaron
> 

I am getting confused about the behaviour:

1) Out of 3 nodes I have 2 nodes up and I am trying to start this node
that's failing. Is this expected that even though there are 2 nodes up one
node will continuously fail with "replication factor (3) exceeds .."
message?

2) When I brought 2 nodes down (out of 3), I was able to start one node
(with 66 % load below) even though auto_bootstrap is set to true. Shouldn't
it have failed for the same reason?

$ nodetool -h `hostname` ring
Address         Status State   Load            Owns    Token
                                                      
113427455640312821154458202477256070484
181.116.206.179  Up     Normal  645.13 KB       33.33%  0
181.116.208.68   Up     Normal  640.16 KB       66.67% 
113427455640312821154458202477256070484


--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6099765.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Error when bringing up nodes during failure testing

Posted by aaron morton <aa...@thelastpickle.com>.

It's failing because when the node bootstraps it does not know about enough nodes to support the RF...

> replication factor (3) exceeds number of
> endpoints (2)

I *think* the normal work around is to disable autobootstrap, bring the nodes up then run "nodetool join" or StorageService.joinRing() via the JConsole.

I not tested this, but reading the code that looks OK. Can you try it out and let me know how it goes?

Aaron


On 8/03/2011, at 7:09 AM, mcasandra wrote:

> 
> aaron morton wrote:
>> 
>> Can you include the full error stack ? 
>> 
> 
> 
> Please find the complete stack trace. Can't really move forward with it not
> knowing the cause:
> 
> 
> ERROR [main] 2011-03-02 16:28:23,923 AbstractCassandraDaemon.java (line 234)
> Exception encountered during startup.
> java.lang.IllegalStateException: replication factor (3) exceeds number of
> endpoints (2)
>        at
> org.apache.cassandra.locator.SimpleStrategy.calculateNaturalEndpoints(SimpleStrategy.java:60)
>        at
> org.apache.cassandra.locator.AbstractReplicationStrategy.getRangeAddresses(AbstractReplicationStrategy.java:204)
>        at
> org.apache.cassandra.dht.BootStrapper.getRangesWithSources(BootStrapper.java:198)
>        at
> org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:83)
>        at
> org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:417)
>        at
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:361)
>        at
> org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:161)
>        at
> org.apache.cassandra.thrift.CassandraDaemon.setup(CassandraDaemon.java:55)
>        at
> org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:217)
>        at
> org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:134)
> 
> 
> --
> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6098332.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Error when bringing up nodes during failure testing

Posted by mcasandra <mo...@gmail.com>.

aaron morton wrote:
> 
> Can you include the full error stack ? 
> 


Please find the complete stack trace. Can't really move forward with it not
knowing the cause:


ERROR [main] 2011-03-02 16:28:23,923 AbstractCassandraDaemon.java (line 234)
Exception encountered during startup.
java.lang.IllegalStateException: replication factor (3) exceeds number of
endpoints (2)
        at
org.apache.cassandra.locator.SimpleStrategy.calculateNaturalEndpoints(SimpleStrategy.java:60)
        at
org.apache.cassandra.locator.AbstractReplicationStrategy.getRangeAddresses(AbstractReplicationStrategy.java:204)
        at
org.apache.cassandra.dht.BootStrapper.getRangesWithSources(BootStrapper.java:198)
        at
org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:83)
        at
org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:417)
        at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:361)
        at
org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:161)
        at
org.apache.cassandra.thrift.CassandraDaemon.setup(CassandraDaemon.java:55)
        at
org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:217)
        at
org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:134)


--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Exception-when-bringing-up-nodes-during-failure-testing-tp6085692p6098332.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Error when bringing up nodes during failure testing

Posted by aaron morton <aa...@thelastpickle.com>.

Can you include the full error stack ? 

It's failing because of the reason stated. But I need some more info to understand what part of the startup process it's stuck at. 

Aaron
  
On 4/03/2011, at 6:39 AM, mcasandra wrote:

> Whenever I do failure testing I see this error message and then cassandra
> process exits. This is what I am doing: 
> 
> 
> 1. 3 node cluster. CF of RF=3, W=QUORUM and R=QUORUM 
> 2. Execute client code in a loop which just reads data from CF in while
> loop. 
> 2. Bring one node down (Node C). Everything ok. Client is happy (as
> expected) 
> 3. Bring one more node down (Node A). Client throws error (as expected) 
> 4. Bring one node up and then I receive following error message in cassandra
> and cassandra exits at this point. 
> 
> Please help. But sometimes when I bring some other node up first (Node C)
> and then bring up this node (A)then it works. Not sure what's going on here. 
> 
> Error in Cassandra logs:
> 
> ERROR 15:36:55,153 Exception encountered during startup. 
> java.lang.IllegalStateException: replication factor (3) exceeds number of
> endpoints (2) 
>        at
> org.apache.cassandra.locator.SimpleStrategy.calculateNaturalEndpoints(SimpleStrategy.java:60) 
>        at
> org.apache.cassandra.locator.AbstractReplicationStrategy.getRangeAddresses(AbstractReplicationStrategy.java:204) 
> 
> 
> --
> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Error-when-bringing-up-nodes-during-failure-testing-tp6085692p6085692.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.