You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Arya Goudarzi <go...@gmail.com> on 2013/04/17 00:42:34 UTC

Repair Freeze / Gossip Invisibility / EC2 Public IP configuration

TL;DR; An EC2 Multi-Region Setup's Repair/Gossip Works with 1.1.10 but with
1.2.4, gossip does not see the nodes after restarting all nodes at once,
and repair gets stuck.

This is a working configuration:
Cassandra 1.1.10 Cluster with 12 nodes in us-east-1 and 12 nodes in
us-west-2
Using Ec2MultiRegionSnitch and SSL enabled for DC_ONLY and
NetworkTopologyStrategy with strategy_options: us-east-1:3;us-west-2:3;
C* instances have a security group called 'cluster1'
security group 'cluster1' in each region is configured as such
Allow TCP:
7199 from cluster1 (JMX)
1024 - 65535 from cluster1 (JMX Random Ports - This supersedes all specific
ports, but I have the specific ports just for clarity )
7100 from cluster1 (Configured Normal Storage)
7103 from cluster1 (Configured SSL Storage)
9160 from cluster1 (Configured Thrift RPC Port)
9160 from <client_group>
foreach node's public IP we also have this rule set to enable cross region
comminication:
7103 from public_ip (Open SSL storage)

The above is a functioning and happy setup. You run repair, and it finishes
successfully.

Broken Setup:

Upgrade to 1.2.4 without changing any of the above security group settings:

Run repair. The repair will get stuck. Thus hanging.

Now for each public_ip add a security group rule as such to cluster1
security group:

Allow TCP: 7100 from public_ip

Run repair. Things will work now. Also after restarting all nodes at the
same time, gossip will see everyone again.

I was told on https://issues.apache.org/jira/browse/CASSANDRA-5432 that
nothing in terms of networking was changed. If nothing in terms of port and
networking was changed in 1.2, then why the above is happening? I can
constantly reproduce it.

Please advice.

-Arya

Re: Repair Freeze / Gossip Invisibility / EC2 Public IP configuration

Posted by Ondřej Černoš <ce...@gmail.com>.
Hi,

I have similar issue  with stuck repair. Similar multiregion setup, only
between us-east and private cloud at rackspace. The log mentiones merkle
tree exchanges and I see a lot of dropped communication:

I will comment on your ticket in Jira.

regards,

ondrej cernos


On Fri, Apr 19, 2013 at 4:50 AM, Arya Goudarzi <go...@gmail.com> wrote:

> We don't use default ports. Woops! Now I advertised mine. I did try
> disabling internode compression for all in cassandra.yaml but still it did
> not work. I have to open the insecure storage port to public ips.
>
>
> On Tue, Apr 16, 2013 at 4:59 PM, Edward Capriolo <ed...@gmail.com>wrote:
>
>> So cassandra does inter node compression. I have not checked but this
>> might be accidentally getting turned on by default. Because the storage
>> port is typically 7000. Not sure why you are allowing 7100. In any case try
>> allowing 7000 or with internode compression off.
>>
>>
>> On Tue, Apr 16, 2013 at 6:42 PM, Arya Goudarzi <go...@gmail.com>wrote:
>>
>>> TL;DR; An EC2 Multi-Region Setup's Repair/Gossip Works with 1.1.10 but
>>> with 1.2.4, gossip does not see the nodes after restarting all nodes at
>>> once, and repair gets stuck.
>>>
>>> This is a working configuration:
>>> Cassandra 1.1.10 Cluster with 12 nodes in us-east-1 and 12 nodes in
>>> us-west-2
>>> Using Ec2MultiRegionSnitch and SSL enabled for DC_ONLY and
>>> NetworkTopologyStrategy with strategy_options: us-east-1:3;us-west-2:3;
>>> C* instances have a security group called 'cluster1'
>>> security group 'cluster1' in each region is configured as such
>>> Allow TCP:
>>> 7199 from cluster1 (JMX)
>>> 1024 - 65535 from cluster1 (JMX Random Ports - This supersedes all
>>> specific ports, but I have the specific ports just for clarity )
>>> 7100 from cluster1 (Configured Normal Storage)
>>> 7103 from cluster1 (Configured SSL Storage)
>>> 9160 from cluster1 (Configured Thrift RPC Port)
>>> 9160 from <client_group>
>>> foreach node's public IP we also have this rule set to enable cross
>>> region comminication:
>>> 7103 from public_ip (Open SSL storage)
>>>
>>> The above is a functioning and happy setup. You run repair, and it
>>> finishes successfully.
>>>
>>> Broken Setup:
>>>
>>> Upgrade to 1.2.4 without changing any of the above security group
>>> settings:
>>>
>>> Run repair. The repair will get stuck. Thus hanging.
>>>
>>> Now for each public_ip add a security group rule as such to cluster1
>>> security group:
>>>
>>> Allow TCP: 7100 from public_ip
>>>
>>> Run repair. Things will work now. Also after restarting all nodes at the
>>> same time, gossip will see everyone again.
>>>
>>> I was told on https://issues.apache.org/jira/browse/CASSANDRA-5432 that
>>> nothing in terms of networking was changed. If nothing in terms of port and
>>> networking was changed in 1.2, then why the above is happening? I can
>>> constantly reproduce it.
>>>
>>> Please advice.
>>>
>>> -Arya
>>>
>>>
>>
>

Re: Repair Freeze / Gossip Invisibility / EC2 Public IP configuration

Posted by Arya Goudarzi <go...@gmail.com>.
We don't use default ports. Woops! Now I advertised mine. I did try
disabling internode compression for all in cassandra.yaml but still it did
not work. I have to open the insecure storage port to public ips.


On Tue, Apr 16, 2013 at 4:59 PM, Edward Capriolo <ed...@gmail.com>wrote:

> So cassandra does inter node compression. I have not checked but this
> might be accidentally getting turned on by default. Because the storage
> port is typically 7000. Not sure why you are allowing 7100. In any case try
> allowing 7000 or with internode compression off.
>
>
> On Tue, Apr 16, 2013 at 6:42 PM, Arya Goudarzi <go...@gmail.com> wrote:
>
>> TL;DR; An EC2 Multi-Region Setup's Repair/Gossip Works with 1.1.10 but
>> with 1.2.4, gossip does not see the nodes after restarting all nodes at
>> once, and repair gets stuck.
>>
>> This is a working configuration:
>> Cassandra 1.1.10 Cluster with 12 nodes in us-east-1 and 12 nodes in
>> us-west-2
>> Using Ec2MultiRegionSnitch and SSL enabled for DC_ONLY and
>> NetworkTopologyStrategy with strategy_options: us-east-1:3;us-west-2:3;
>> C* instances have a security group called 'cluster1'
>> security group 'cluster1' in each region is configured as such
>> Allow TCP:
>> 7199 from cluster1 (JMX)
>> 1024 - 65535 from cluster1 (JMX Random Ports - This supersedes all
>> specific ports, but I have the specific ports just for clarity )
>> 7100 from cluster1 (Configured Normal Storage)
>> 7103 from cluster1 (Configured SSL Storage)
>> 9160 from cluster1 (Configured Thrift RPC Port)
>> 9160 from <client_group>
>> foreach node's public IP we also have this rule set to enable cross
>> region comminication:
>> 7103 from public_ip (Open SSL storage)
>>
>> The above is a functioning and happy setup. You run repair, and it
>> finishes successfully.
>>
>> Broken Setup:
>>
>> Upgrade to 1.2.4 without changing any of the above security group
>> settings:
>>
>> Run repair. The repair will get stuck. Thus hanging.
>>
>> Now for each public_ip add a security group rule as such to cluster1
>> security group:
>>
>> Allow TCP: 7100 from public_ip
>>
>> Run repair. Things will work now. Also after restarting all nodes at the
>> same time, gossip will see everyone again.
>>
>> I was told on https://issues.apache.org/jira/browse/CASSANDRA-5432 that
>> nothing in terms of networking was changed. If nothing in terms of port and
>> networking was changed in 1.2, then why the above is happening? I can
>> constantly reproduce it.
>>
>> Please advice.
>>
>> -Arya
>>
>>
>

Re: Repair Freeze / Gossip Invisibility / EC2 Public IP configuration

Posted by Edward Capriolo <ed...@gmail.com>.
So cassandra does inter node compression. I have not checked but this might
be accidentally getting turned on by default. Because the storage port is
typically 7000. Not sure why you are allowing 7100. In any case try
allowing 7000 or with internode compression off.


On Tue, Apr 16, 2013 at 6:42 PM, Arya Goudarzi <go...@gmail.com> wrote:

> TL;DR; An EC2 Multi-Region Setup's Repair/Gossip Works with 1.1.10 but
> with 1.2.4, gossip does not see the nodes after restarting all nodes at
> once, and repair gets stuck.
>
> This is a working configuration:
> Cassandra 1.1.10 Cluster with 12 nodes in us-east-1 and 12 nodes in
> us-west-2
> Using Ec2MultiRegionSnitch and SSL enabled for DC_ONLY and
> NetworkTopologyStrategy with strategy_options: us-east-1:3;us-west-2:3;
> C* instances have a security group called 'cluster1'
> security group 'cluster1' in each region is configured as such
> Allow TCP:
> 7199 from cluster1 (JMX)
> 1024 - 65535 from cluster1 (JMX Random Ports - This supersedes all
> specific ports, but I have the specific ports just for clarity )
> 7100 from cluster1 (Configured Normal Storage)
> 7103 from cluster1 (Configured SSL Storage)
> 9160 from cluster1 (Configured Thrift RPC Port)
> 9160 from <client_group>
> foreach node's public IP we also have this rule set to enable cross region
> comminication:
> 7103 from public_ip (Open SSL storage)
>
> The above is a functioning and happy setup. You run repair, and it
> finishes successfully.
>
> Broken Setup:
>
> Upgrade to 1.2.4 without changing any of the above security group settings:
>
> Run repair. The repair will get stuck. Thus hanging.
>
> Now for each public_ip add a security group rule as such to cluster1
> security group:
>
> Allow TCP: 7100 from public_ip
>
> Run repair. Things will work now. Also after restarting all nodes at the
> same time, gossip will see everyone again.
>
> I was told on https://issues.apache.org/jira/browse/CASSANDRA-5432 that
> nothing in terms of networking was changed. If nothing in terms of port and
> networking was changed in 1.2, then why the above is happening? I can
> constantly reproduce it.
>
> Please advice.
>
> -Arya
>
>