You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Philipp Potisk <ph...@geroba.at> on 2014/06/10 23:21:10 UTC

StreamException while adding nodes

Hi,

I tried to double the size of an existing cluster from 4 to 8 nodes. First
I added one node, which joined after 120min successfully. During that time
there was no additional load on the cluster. Afterwards I started the other
3 new nodes after each other in order to join the cluster simultaneously.
Furthermore I put some write-load on the cluster. After 45min of the
process 2 nodes died with following exception.

Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
        at
org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85)
        at
com.google.common.util.concurrent.Futures$4.run(Futures.java:1160)

Since I have restarted Cassandra on the failing nodes (8 hours ago), the 3
nodes remain in status JOINING, but there is no data exchange going on any
more.

Furthermore, nodetool info throws the exception:

Exception in thread "main" java.lang.AssertionError
        at
org.apache.cassandra.locator.TokenMetadata.getTokens(TokenMetadata.java:502)
        at
org.apache.cassandra.service.StorageService.getTokens(StorageService.java:2132)

which corresponds to isMember returning FALSE.

 public Collection<Token> getTokens(InetAddress endpoint)
    {
        assert endpoint != null;
        assert isMember(endpoint);


My questions right now are:
- What could have caused the streaming error?
- Shouldn't nodes be added while there is some load on the cluster? OS load
was between 2 and 6 on a dual core machine.
- Would it have been better to add the 3 new nodes one by one, rather than
simultaneously?
- How should I proceed with the 3 half joined nodes as they are not willing
to exchange the missing data?

We are using, Cassandra 2.0.7 (vnodes and broadly the default config) and
RF 2, with each node having roughly 17 GB of data on it.

Thanks for any hints,
Phil

Re: StreamException while adding nodes

Posted by Philipp Potisk <ph...@omnecon.com>.
As we are still failing to add the 3 additional nodes, we still appreciate
any further thoughts.

I have removed all 3 half-joined nodes, deleted the data-directories and
started only one node. Since than (more than 24h hoursa ago) the node is in
status JOINING (nodetool status: UJ, nodetool gossipinfo:
STATUS:BOOT,-7774403902045887560) but does not receive any data.

nodetool status shows that only 5,72MB has arrived so far:

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Owns (effective)  Host
ID                               Token
Rack
UJ  10.140.118.4    5.72 MB    ?
dc110f47-67b0-40c9-bef7-3dff59bfe29c
-9201583989361968764                     rack1
UN  10.53.186.53    29.59 GB   43.1%
80cb0036-33b9-4c37-b789-7dac340034ee
-9137279293977023905                     rack1
UN  10.140.120.27   25.27 GB   37.8%
2564094b-08ea-42c4-82b0-a8246bd3ebcf
-9201237785760477995                     rack1
UN  10.53.170.3     26.82 GB   38.1%
737f49e5-684f-46ef-bf8b-c82326128835
-9106630210265624873                     rack1
UN  10.140.104.105  27.88 GB   39.7%
18c74472-235d-4284-9906-0ab8cc40011d
-9213643688261125087                     rack1
UN  10.53.170.41    26.28 GB   41.3%
866d2276-0dac-41b3-aece-6a2711ef0234
-9031518559431277310                     rack1

Furthermore it is very strange that nodetool describering, does not have
the IP of the new node included in the endpoints-list. Command:

nodetool describering TransactionUseCaseAddNodes | grep 10.140.118.4

does not output anything.

It seems that no token-ranges are assigned to this node. However, according
the documentation regarding vnodes, rebalancing should be done
automatically.
Is there still a way to force rebalancing in Cassandra 2.X using vnodes? Or
is there something else I could look into?



On 11 June 2014 08:26, Philipp Potisk <ph...@omnecon.com> wrote:

> Hey Rob,
>
> thanks for pointing out the issue with simultaneous bootstraps. However, I
> am not sure if this applies in my case. As a matter of fact I did not start
> the nodes simultaneously - I waited about 10min until they were receiving
> streams from other nodes. So I guess the topology-changes were exchanged as
> expected. Only the joining of the 3 nodes was done simultaneously.
> The StreamException, which killed the process, also happened in a later
> point of time. Since than the nodes are not picking up the join-process
> again. I am now thinking of decommissioning and staring all over again.
>
> Phil
>
>
> On 11 June 2014 03:13, Robert Coli <rc...@eventbrite.com> wrote:
>
>> On Tue, Jun 10, 2014 at 2:21 PM, Philipp Potisk <philipp.potisk@geroba.at
>> > wrote:
>>
>>> First I added one node, which joined after 120min successfully. During
>>> that time there was no additional load on the cluster. Afterwards I started
>>> the other 3 new nodes after each other in order to join the cluster
>>> simultaneously.
>>>
>>
>> Bootstrapping multiple nodes at once is now and has always been Not
>> Supported, but is such a common thing for new operators to try that there
>> is now a goal to prevent them from doing it [1].
>>
>> Cancel those simultaneous bootstraps and do them one at a time, and
>> they'll probably work.
>>
>> [1] https://issues.apache.org/jira/browse/CASSANDRA-7069
>>
>> =Rob
>>
>
>
>
> --
> DI Philipp Potisk
>
> Omnecon IT e.U.
>
> Klabundgasse 5-7/3/17
> 1190 Wien
>
> Tel.: +43 660 46 02 632
> E-Mail.: philipp.potisk@omnecon.com
>
> Firmenbuchnummer: FN 342255 t
> UID: ATU65503966
>



-- 
DI Philipp Potisk

Omnecon IT e.U.

Klabundgasse 5-7/3/17
1190 Wien

Tel.: +43 660 46 02 632
E-Mail.: philipp.potisk@omnecon.com

Firmenbuchnummer: FN 342255 t
UID: ATU65503966

Re: StreamException while adding nodes

Posted by Philipp Potisk <ph...@omnecon.com>.
Hey Rob,

thanks for pointing out the issue with simultaneous bootstraps. However, I
am not sure if this applies in my case. As a matter of fact I did not start
the nodes simultaneously - I waited about 10min until they were receiving
streams from other nodes. So I guess the topology-changes were exchanged as
expected. Only the joining of the 3 nodes was done simultaneously.
The StreamException, which killed the process, also happened in a later
point of time. Since than the nodes are not picking up the join-process
again. I am now thinking of decommissioning and staring all over again.

Phil


On 11 June 2014 03:13, Robert Coli <rc...@eventbrite.com> wrote:

> On Tue, Jun 10, 2014 at 2:21 PM, Philipp Potisk <ph...@geroba.at>
> wrote:
>
>> First I added one node, which joined after 120min successfully. During
>> that time there was no additional load on the cluster. Afterwards I started
>> the other 3 new nodes after each other in order to join the cluster
>> simultaneously.
>>
>
> Bootstrapping multiple nodes at once is now and has always been Not
> Supported, but is such a common thing for new operators to try that there
> is now a goal to prevent them from doing it [1].
>
> Cancel those simultaneous bootstraps and do them one at a time, and
> they'll probably work.
>
> [1] https://issues.apache.org/jira/browse/CASSANDRA-7069
>
> =Rob
>



-- 
DI Philipp Potisk

Omnecon IT e.U.

Klabundgasse 5-7/3/17
1190 Wien

Tel.: +43 660 46 02 632
E-Mail.: philipp.potisk@omnecon.com

Firmenbuchnummer: FN 342255 t
UID: ATU65503966

Re: StreamException while adding nodes

Posted by Robert Coli <rc...@eventbrite.com>.
On Tue, Jun 10, 2014 at 2:21 PM, Philipp Potisk <ph...@geroba.at>
wrote:

> First I added one node, which joined after 120min successfully. During
> that time there was no additional load on the cluster. Afterwards I started
> the other 3 new nodes after each other in order to join the cluster
> simultaneously.
>

Bootstrapping multiple nodes at once is now and has always been Not
Supported, but is such a common thing for new operators to try that there
is now a goal to prevent them from doing it [1].

Cancel those simultaneous bootstraps and do them one at a time, and they'll
probably work.

[1] https://issues.apache.org/jira/browse/CASSANDRA-7069

=Rob