You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Skye Book <sk...@gmail.com> on 2013/11/18 08:36:00 UTC

Re: Nodes not added to existing cluster

Hi there,

I’m bringing this thread back as its something that I thought was solved and is apparently not fixed on my end.

To recap, I’m having trouble getting a node to join a cluster.  Configuration seems all right using the EC2MultiRegionSnitch but new nodes are unable to handshake with seeds.

- Security Group has 22 && 1024-65535 open
- Nodes are configured with password authentication using CassandraAuthorizer
- internode_authenticator is commented out in configuration
- rpc_address is set to the instance’s private address
- listen_address is set to the instance’s private address
- broadcast_address is set to the instance's public address

As was suggested earlier, I’ve enabled TRACE logging for OutboundTcpConnection and get the following dumped into system.log when the new node is started up without itself in the seed list (if its own IP is in the list it just creates a new single node cluster).  I’ve gisted the results here: https://gist.github.com/skyebook/be5ee75a000a1e6d65d0

It looks like the handshake process completely and utterly fails as it seems unable to get any information from the other nodes as evidenced by:
OutboundTcpConnection.java (line 386) Handshaking version with /NODE_1_PUBLIC_IP
OutboundTcpConnection.java (line 386) Handshaking version with /NODE_2_PUBLIC_IP
OutboundTcpConnection.java (line 333) Target max version is -2147483648; no version information yet, will retry

Thanks in advance for any light you all might be able to shed on what’s going on.

On Sep 26, 2013, at 9:03 PM, Aaron Morton <aa...@thelastpickle.com> wrote:

>>  INFO 05:03:49,015 Cannot handshake version with /aa.bb.cc.dd
>>  INFO 05:03:49,017 Handshaking version with /aa.bb.cc.dd
> If you can turn up logging to TRACE for org.apache.cassandra.net.OutboundTcpConnection it will include the full error. 
> 
>> The two addresses that it is unable to handshake with are the other two addresses of nodes in the cluster I'm unable to join.
> Are you mixing versions ? 
> 
> 
> Cheers
> 
> -----------------
> Aaron Morton
> New Zealand
> @aaronmorton
> 
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
> 
> On 26/09/2013, at 5:13 PM, Skye Book <sk...@gmail.com> wrote:
> 
>> Hi Aaron, thanks for the clarification.
>> 
>> As might be expected, having the broadcast_address fixed hasn't fixed anything.  What I did find after writing my last email is that output.log is littered with these:
>> 
>>  INFO 05:03:49,015 Cannot handshake version with /aa.bb.cc.dd
>>  INFO 05:03:49,017 Handshaking version with /aa.bb.cc.dd
>>  INFO 05:03:49,803 Cannot handshake version with /ww.xx.yy.zz
>>  INFO 05:03:49,805 Handshaking version with /ww.xx.yy.zz
>> 
>> The two addresses that it is unable to handshake with are the other two addresses of nodes in the cluster I'm unable to join.  I started thinking that maybe EC2 was having an-advertised problem communicating between AZ's but bringing up nodes in both of the other availability zones resulted in the same wrong behavior.
>> 
>> I've gist'd my cassandra.yaml, its pretty standard and hasn't caused an issue in the past for me.  https://gist.github.com/skyebook/ec9364cdcec02e803ffc
>> 
>> Skye Book
>> http://skyebook.net -- @sbook
>> 
>> On Sep 26, 2013, at 12:34 AM, Aaron Morton <aa...@thelastpickle.com> wrote:
>> 
>>>>  I am curious, though, how any of this worked in the first place spread across three AZ's without that being set?
>>> boradcast_address is only needed when you are going cross region (IIRC it's the EC2MultiRegionSnitch) that sets it. 
>>> 
>>> As rob said, make sure the seed list includes on of the other nodes and that the cluster_name set. 
>>> 
>>> Cheers
>>> 
>>> -----------------
>>> Aaron Morton
>>> New Zealand
>>> @aaronmorton
>>> 
>>> Co-Founder & Principal Consultant
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>> 
>>> On 26/09/2013, at 8:12 AM, Skye Book <sk...@gmail.com> wrote:
>>> 
>>>> Thank you, both Michael and Robert for your suggestions.  I actually saw 5760, but we were running on 2.0.0, which it seems like this was fixed in.
>>>> 
>>>> That said, I noticed that my Chef scripts were failing to set the broadcast_address correctly, which I'm guessing is the cause of the problem, fixing that and trying a redeploy.  I am curious, though, how any of this worked in the first place spread across three AZ's without that being set?
>>>> 
>>>> -Skye
>>>> 
>>>> On Sep 25, 2013, at 3:56 PM, Robert Coli <rc...@eventbrite.com> wrote:
>>>> 
>>>>> On Wed, Sep 25, 2013 at 12:41 PM, Skye Book <sk...@gmail.com> wrote:
>>>>> I have a three node cluster using the EC2 Multi-Region Snitch currently operating only in US-EAST.  On having a node go down this morning, I started a new node with an identical configuration, except for the seed list, the listen address and the rpc address.  The new node comes up and creates its own cluster rather than joining the pre-existing ring.  I've tried creating a node both before ad after using `nodetool remove` for the bad node, each time with the same result.
>>>>> 
>>>>> What version of Cassandra?
>>>>> 
>>>>> This particular confusing behavior is fixed upstream, in a version you should not deploy to production yet. Take some solace, however, that you may be the last Cassandra administrator to die for a broken code path!
>>>>> 
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-5768
>>>>> 
>>>>> Does anyone have any suggestions for where to look that might put me on the right track?
>>>>> 
>>>>> It must be that your seed list is wrong in some way, or your node state is wrong. If you're trying to bootstrap a node, note that you can't bootstrap a node when it is in its own seed list.
>>>>> 
>>>>> If you have installed Cassandra via debian package, there is a possibility that your node has started before you explicitly started it. If so, it might have invalid node state.
>>>>> 
>>>>> Have you tried wiping the data directory and trying again?
>>>>> 
>>>>> What is your seed list? Are you sure the new node can reach the seeds on the network layer?
>>>>> 
>>>>> =Rob
>>>> 
>>> 
>> 
> 


Re: Nodes not added to existing cluster

Posted by Aaron Morton <aa...@thelastpickle.com>.
> - broadcast_address is set to the instance's public address
You only need this if you have a multi region setup. 

>  I’ve gisted the results here: https://gist.github.com/skyebook/be5ee75a000a1e6d65d0

This error

TRACE [HANDSHAKE-/NODE_1_PUBLIC_IP] 2013-11-18 06:57:13,984 OutboundTcpConnection.java (line 393) Cannot handshake version with /NODE_1_PUBLIC_IP
java.nio.channels.AsynchronousCloseException
	at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:402)
	at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
	at java.io.InputStream.read(InputStream.java:101)
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:81)
	at java.io.DataInputStream.readInt(DataInputStream.java:387)
	at org.apache.cassandra.net.OutboundTcpConnection$1.run(OutboundTcpConnection.java:387)

Is preventing the node from reading the version and results in this line being printed ( -2147483648 is the no version flag)

> OutboundTcpConnection.java (line 333) Target max version is -2147483648; no version information yet, will retry

 
Not really sure why that exception is being thrown, the help does not make it clear http://docs.oracle.com/javase/7/docs/api/java/nio/channels/AsynchronousCloseException.html

Check the networking. 

Hope that helps. 

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 18/11/2013, at 8:36 pm, Skye Book <sk...@gmail.com> wrote:

> Hi there,
> 
> I’m bringing this thread back as its something that I thought was solved and is apparently not fixed on my end.
> 
> To recap, I’m having trouble getting a node to join a cluster.  Configuration seems all right using the EC2MultiRegionSnitch but new nodes are unable to handshake with seeds.
> 
> - Security Group has 22 && 1024-65535 open
> - Nodes are configured with password authentication using CassandraAuthorizer
> - internode_authenticator is commented out in configuration
> - rpc_address is set to the instance’s private address
> - listen_address is set to the instance’s private address
> - broadcast_address is set to the instance's public address
> 
> As was suggested earlier, I’ve enabled TRACE logging for OutboundTcpConnection and get the following dumped into system.log when the new node is started up without itself in the seed list (if its own IP is in the list it just creates a new single node cluster).  I’ve gisted the results here: https://gist.github.com/skyebook/be5ee75a000a1e6d65d0
> 
> It looks like the handshake process completely and utterly fails as it seems unable to get any information from the other nodes as evidenced by:
> OutboundTcpConnection.java (line 386) Handshaking version with /NODE_1_PUBLIC_IP
> OutboundTcpConnection.java (line 386) Handshaking version with /NODE_2_PUBLIC_IP
> OutboundTcpConnection.java (line 333) Target max version is -2147483648; no version information yet, will retry
> 
> Thanks in advance for any light you all might be able to shed on what’s going on.
> 
> On Sep 26, 2013, at 9:03 PM, Aaron Morton <aa...@thelastpickle.com> wrote:
> 
>>>  INFO 05:03:49,015 Cannot handshake version with /aa.bb.cc.dd
>>>  INFO 05:03:49,017 Handshaking version with /aa.bb.cc.dd
>> If you can turn up logging to TRACE for org.apache.cassandra.net.OutboundTcpConnection it will include the full error. 
>> 
>>> The two addresses that it is unable to handshake with are the other two addresses of nodes in the cluster I'm unable to join.
>> Are you mixing versions ? 
>> 
>> 
>> Cheers
>> 
>> -----------------
>> Aaron Morton
>> New Zealand
>> @aaronmorton
>> 
>> Co-Founder & Principal Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>> 
>> On 26/09/2013, at 5:13 PM, Skye Book <sk...@gmail.com> wrote:
>> 
>>> Hi Aaron, thanks for the clarification.
>>> 
>>> As might be expected, having the broadcast_address fixed hasn't fixed anything.  What I did find after writing my last email is that output.log is littered with these:
>>> 
>>>  INFO 05:03:49,015 Cannot handshake version with /aa.bb.cc.dd
>>>  INFO 05:03:49,017 Handshaking version with /aa.bb.cc.dd
>>>  INFO 05:03:49,803 Cannot handshake version with /ww.xx.yy.zz
>>>  INFO 05:03:49,805 Handshaking version with /ww.xx.yy.zz
>>> 
>>> The two addresses that it is unable to handshake with are the other two addresses of nodes in the cluster I'm unable to join.  I started thinking that maybe EC2 was having an-advertised problem communicating between AZ's but bringing up nodes in both of the other availability zones resulted in the same wrong behavior.
>>> 
>>> I've gist'd my cassandra.yaml, its pretty standard and hasn't caused an issue in the past for me.  https://gist.github.com/skyebook/ec9364cdcec02e803ffc
>>> 
>>> Skye Book
>>> http://skyebook.net -- @sbook
>>> 
>>> On Sep 26, 2013, at 12:34 AM, Aaron Morton <aa...@thelastpickle.com> wrote:
>>> 
>>>>>  I am curious, though, how any of this worked in the first place spread across three AZ's without that being set?
>>>> boradcast_address is only needed when you are going cross region (IIRC it's the EC2MultiRegionSnitch) that sets it. 
>>>> 
>>>> As rob said, make sure the seed list includes on of the other nodes and that the cluster_name set. 
>>>> 
>>>> Cheers
>>>> 
>>>> -----------------
>>>> Aaron Morton
>>>> New Zealand
>>>> @aaronmorton
>>>> 
>>>> Co-Founder & Principal Consultant
>>>> Apache Cassandra Consulting
>>>> http://www.thelastpickle.com
>>>> 
>>>> On 26/09/2013, at 8:12 AM, Skye Book <sk...@gmail.com> wrote:
>>>> 
>>>>> Thank you, both Michael and Robert for your suggestions.  I actually saw 5760, but we were running on 2.0.0, which it seems like this was fixed in.
>>>>> 
>>>>> That said, I noticed that my Chef scripts were failing to set the broadcast_address correctly, which I'm guessing is the cause of the problem, fixing that and trying a redeploy.  I am curious, though, how any of this worked in the first place spread across three AZ's without that being set?
>>>>> 
>>>>> -Skye
>>>>> 
>>>>> On Sep 25, 2013, at 3:56 PM, Robert Coli <rc...@eventbrite.com> wrote:
>>>>> 
>>>>>> On Wed, Sep 25, 2013 at 12:41 PM, Skye Book <sk...@gmail.com> wrote:
>>>>>> I have a three node cluster using the EC2 Multi-Region Snitch currently operating only in US-EAST.  On having a node go down this morning, I started a new node with an identical configuration, except for the seed list, the listen address and the rpc address.  The new node comes up and creates its own cluster rather than joining the pre-existing ring.  I've tried creating a node both before ad after using `nodetool remove` for the bad node, each time with the same result.
>>>>>> 
>>>>>> What version of Cassandra?
>>>>>> 
>>>>>> This particular confusing behavior is fixed upstream, in a version you should not deploy to production yet. Take some solace, however, that you may be the last Cassandra administrator to die for a broken code path!
>>>>>> 
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-5768
>>>>>> 
>>>>>> Does anyone have any suggestions for where to look that might put me on the right track?
>>>>>> 
>>>>>> It must be that your seed list is wrong in some way, or your node state is wrong. If you're trying to bootstrap a node, note that you can't bootstrap a node when it is in its own seed list.
>>>>>> 
>>>>>> If you have installed Cassandra via debian package, there is a possibility that your node has started before you explicitly started it. If so, it might have invalid node state.
>>>>>> 
>>>>>> Have you tried wiping the data directory and trying again?
>>>>>> 
>>>>>> What is your seed list? Are you sure the new node can reach the seeds on the network layer?
>>>>>> 
>>>>>> =Rob
>>>>> 
>>>> 
>>> 
>> 
>