You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Steven A Robenalt <sr...@stanford.edu> on 2013/11/19 02:49:07 UTC

Cassandra 2.0.2 - Frequent Read timeouts and delays in replication on 3-node cluster in AWS VPC

Hi all,

I am attempting to bring up our new app on a 3-node cluster and am having
problems with frequent read timeouts and slow inter-node replication.
Initially, these errors were mostly occurring in our app server, affecting
0.02%-1.0% of our queries in an otherwise unloaded cluster. No exceptions
were logged on the servers in this case, and reads in a single node
environment with the same code and client driver virtually never see
exceptions like this, so I suspect problems with the inter-cluster
communication between nodes.

The 3 nodes are deployed in a single AWS VPC, and are all in a common
subnet. The Cassandra version is 2.0.2 following an upgrade this past
weekend due to NPEs in a secondary index that were affecting certain
queries under 2.0.1. The servers are m1.large instances running AWS Linux
and Oracle JDK7u40. The first 2 nodes in the cluster are the seed nodes.
All database contents are CQL tables with replication factor of 3, and the
application is Java-based, using the latest Datastax 2.0.0-rc1 Java Driver.

In testing with the application, I noticed this afternoon that the contents
of the 3 nodes differed in their respective copies of the same table for
newly written data, for time periods exceeding several minutes, as reported
by cqlsh on each node. Specifying different hosts from the same server
using cqlsh also exhibited timeouts on multiple attempts to connect, and on
executing some queries, though they eventually succeeded in all cases, and
eventually the data in all nodes was fully replicated.

The AWS servers have a security group with only ports 22, 7000, 9042, and
9160 open.

At this time, it seems that either I am still missing something in my
cluster configuration, or maybe there are other ports that are needed for
inter-node communication.

Any advice/suggestions would be appreciated.



-- 
Steve Robenalt
Software Architect
HighWire | Stanford University
425 Broadway St, Redwood City, CA 94063

srobenal@stanford.edu
http://highwire.stanford.edu

Re: Cassandra 2.0.2 - Frequent Read timeouts and delays in replication on 3-node cluster in AWS VPC

Posted by Steven A Robenalt <sr...@stanford.edu>.

Looks like the read timeouts were a result of a bug that will be fixed in
2.0.3.

I found this question on the Datastax Java Driver mailing list:
https://groups.google.com/a/lists.datastax.com/forum/#!topic/java-driver-user/ao1ohSLpjRM

which led me to:
https://issues.apache.org/jira/browse/CASSANDRA-6299

I built and deployed a 2.0.3 snapshot this morning, which includes this
fix, and my cluster is now behaving normally (no read timeouts so far).



On Tue, Nov 19, 2013 at 4:55 PM, Steven A Robenalt <sr...@stanford.edu>wrote:

> It seems that with NTP properly configured, the replication is now working
> as expected, but there are still a lot of read timeouts. The
> troubleshooting continues...
>
>
> On Tue, Nov 19, 2013 at 8:53 AM, Steven A Robenalt <sr...@stanford.edu>wrote:
>
>> Thanks Michael, I will try that out.
>>
>>
>> On Tue, Nov 19, 2013 at 5:28 AM, Laing, Michael <
>> michael.laing@nytimes.com> wrote:
>>
>>> We had a similar problem when our nodes could not sync using ntp due to
>>> VPC ACL settings. -ml
>>>
>>>
>>> On Mon, Nov 18, 2013 at 8:49 PM, Steven A Robenalt <
>>> srobenal@stanford.edu> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am attempting to bring up our new app on a 3-node cluster and am
>>>> having problems with frequent read timeouts and slow inter-node
>>>> replication. Initially, these errors were mostly occurring in our app
>>>> server, affecting 0.02%-1.0% of our queries in an otherwise unloaded
>>>> cluster. No exceptions were logged on the servers in this case, and reads
>>>> in a single node environment with the same code and client driver virtually
>>>> never see exceptions like this, so I suspect problems with the
>>>> inter-cluster communication between nodes.
>>>>
>>>> The 3 nodes are deployed in a single AWS VPC, and are all in a common
>>>> subnet. The Cassandra version is 2.0.2 following an upgrade this past
>>>> weekend due to NPEs in a secondary index that were affecting certain
>>>> queries under 2.0.1. The servers are m1.large instances running AWS Linux
>>>> and Oracle JDK7u40. The first 2 nodes in the cluster are the seed nodes.
>>>> All database contents are CQL tables with replication factor of 3, and the
>>>> application is Java-based, using the latest Datastax 2.0.0-rc1 Java Driver.
>>>>
>>>> In testing with the application, I noticed this afternoon that the
>>>> contents of the 3 nodes differed in their respective copies of the same
>>>> table for newly written data, for time periods exceeding several minutes,
>>>> as reported by cqlsh on each node. Specifying different hosts from the same
>>>> server using cqlsh also exhibited timeouts on multiple attempts to connect,
>>>> and on executing some queries, though they eventually succeeded in all
>>>> cases, and eventually the data in all nodes was fully replicated.
>>>>
>>>> The AWS servers have a security group with only ports 22, 7000, 9042,
>>>> and 9160 open.
>>>>
>>>> At this time, it seems that either I am still missing something in my
>>>> cluster configuration, or maybe there are other ports that are needed for
>>>> inter-node communication.
>>>>
>>>> Any advice/suggestions would be appreciated.
>>>>
>>>>
>>>>
>>>> --
>>>> Steve Robenalt
>>>> Software Architect
>>>> HighWire | Stanford University
>>>> 425 Broadway St, Redwood City, CA 94063
>>>>
>>>> srobenal@stanford.edu
>>>> http://highwire.stanford.edu
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Steve Robenalt
>> Software Architect
>> HighWire | Stanford University
>> 425 Broadway St, Redwood City, CA 94063
>>
>> srobenal@stanford.edu
>> http://highwire.stanford.edu
>>
>>
>>
>>
>>
>>
>
>
> --
> Steve Robenalt
> Software Architect
> HighWire | Stanford University
> 425 Broadway St, Redwood City, CA 94063
>
> srobenal@stanford.edu
> http://highwire.stanford.edu
>
>
>
>
>
>


-- 
Steve Robenalt
Software Architect
HighWire | Stanford University
425 Broadway St, Redwood City, CA 94063

srobenal@stanford.edu
http://highwire.stanford.edu

Re: Cassandra 2.0.2 - Frequent Read timeouts and delays in replication on 3-node cluster in AWS VPC

Posted by Steven A Robenalt <sr...@stanford.edu>.

It seems that with NTP properly configured, the replication is now working
as expected, but there are still a lot of read timeouts. The
troubleshooting continues...


On Tue, Nov 19, 2013 at 8:53 AM, Steven A Robenalt <sr...@stanford.edu>wrote:

> Thanks Michael, I will try that out.
>
>
> On Tue, Nov 19, 2013 at 5:28 AM, Laing, Michael <michael.laing@nytimes.com
> > wrote:
>
>> We had a similar problem when our nodes could not sync using ntp due to
>> VPC ACL settings. -ml
>>
>>
>> On Mon, Nov 18, 2013 at 8:49 PM, Steven A Robenalt <srobenal@stanford.edu
>> > wrote:
>>
>>> Hi all,
>>>
>>> I am attempting to bring up our new app on a 3-node cluster and am
>>> having problems with frequent read timeouts and slow inter-node
>>> replication. Initially, these errors were mostly occurring in our app
>>> server, affecting 0.02%-1.0% of our queries in an otherwise unloaded
>>> cluster. No exceptions were logged on the servers in this case, and reads
>>> in a single node environment with the same code and client driver virtually
>>> never see exceptions like this, so I suspect problems with the
>>> inter-cluster communication between nodes.
>>>
>>> The 3 nodes are deployed in a single AWS VPC, and are all in a common
>>> subnet. The Cassandra version is 2.0.2 following an upgrade this past
>>> weekend due to NPEs in a secondary index that were affecting certain
>>> queries under 2.0.1. The servers are m1.large instances running AWS Linux
>>> and Oracle JDK7u40. The first 2 nodes in the cluster are the seed nodes.
>>> All database contents are CQL tables with replication factor of 3, and the
>>> application is Java-based, using the latest Datastax 2.0.0-rc1 Java Driver.
>>>
>>> In testing with the application, I noticed this afternoon that the
>>> contents of the 3 nodes differed in their respective copies of the same
>>> table for newly written data, for time periods exceeding several minutes,
>>> as reported by cqlsh on each node. Specifying different hosts from the same
>>> server using cqlsh also exhibited timeouts on multiple attempts to connect,
>>> and on executing some queries, though they eventually succeeded in all
>>> cases, and eventually the data in all nodes was fully replicated.
>>>
>>> The AWS servers have a security group with only ports 22, 7000, 9042,
>>> and 9160 open.
>>>
>>> At this time, it seems that either I am still missing something in my
>>> cluster configuration, or maybe there are other ports that are needed for
>>> inter-node communication.
>>>
>>> Any advice/suggestions would be appreciated.
>>>
>>>
>>>
>>> --
>>> Steve Robenalt
>>> Software Architect
>>> HighWire | Stanford University
>>> 425 Broadway St, Redwood City, CA 94063
>>>
>>> srobenal@stanford.edu
>>> http://highwire.stanford.edu
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
>
> --
> Steve Robenalt
> Software Architect
> HighWire | Stanford University
> 425 Broadway St, Redwood City, CA 94063
>
> srobenal@stanford.edu
> http://highwire.stanford.edu
>
>
>
>
>
>


-- 
Steve Robenalt
Software Architect
HighWire | Stanford University
425 Broadway St, Redwood City, CA 94063

srobenal@stanford.edu
http://highwire.stanford.edu

Re: Cassandra 2.0.2 - Frequent Read timeouts and delays in replication on 3-node cluster in AWS VPC

Posted by Steven A Robenalt <sr...@stanford.edu>.

Thanks Michael, I will try that out.


On Tue, Nov 19, 2013 at 5:28 AM, Laing, Michael
<mi...@nytimes.com>wrote:

> We had a similar problem when our nodes could not sync using ntp due to
> VPC ACL settings. -ml
>
>
> On Mon, Nov 18, 2013 at 8:49 PM, Steven A Robenalt <sr...@stanford.edu>wrote:
>
>> Hi all,
>>
>> I am attempting to bring up our new app on a 3-node cluster and am having
>> problems with frequent read timeouts and slow inter-node replication.
>> Initially, these errors were mostly occurring in our app server, affecting
>> 0.02%-1.0% of our queries in an otherwise unloaded cluster. No exceptions
>> were logged on the servers in this case, and reads in a single node
>> environment with the same code and client driver virtually never see
>> exceptions like this, so I suspect problems with the inter-cluster
>> communication between nodes.
>>
>> The 3 nodes are deployed in a single AWS VPC, and are all in a common
>> subnet. The Cassandra version is 2.0.2 following an upgrade this past
>> weekend due to NPEs in a secondary index that were affecting certain
>> queries under 2.0.1. The servers are m1.large instances running AWS Linux
>> and Oracle JDK7u40. The first 2 nodes in the cluster are the seed nodes.
>> All database contents are CQL tables with replication factor of 3, and the
>> application is Java-based, using the latest Datastax 2.0.0-rc1 Java Driver.
>>
>> In testing with the application, I noticed this afternoon that the
>> contents of the 3 nodes differed in their respective copies of the same
>> table for newly written data, for time periods exceeding several minutes,
>> as reported by cqlsh on each node. Specifying different hosts from the same
>> server using cqlsh also exhibited timeouts on multiple attempts to connect,
>> and on executing some queries, though they eventually succeeded in all
>> cases, and eventually the data in all nodes was fully replicated.
>>
>> The AWS servers have a security group with only ports 22, 7000, 9042, and
>> 9160 open.
>>
>> At this time, it seems that either I am still missing something in my
>> cluster configuration, or maybe there are other ports that are needed for
>> inter-node communication.
>>
>> Any advice/suggestions would be appreciated.
>>
>>
>>
>> --
>> Steve Robenalt
>> Software Architect
>> HighWire | Stanford University
>> 425 Broadway St, Redwood City, CA 94063
>>
>> srobenal@stanford.edu
>> http://highwire.stanford.edu
>>
>>
>>
>>
>>
>>
>


-- 
Steve Robenalt
Software Architect
HighWire | Stanford University
425 Broadway St, Redwood City, CA 94063

srobenal@stanford.edu
http://highwire.stanford.edu

Re: Cassandra 2.0.2 - Frequent Read timeouts and delays in replication on 3-node cluster in AWS VPC

Posted by "Laing, Michael" <mi...@nytimes.com>.

We had a similar problem when our nodes could not sync using ntp due to VPC
ACL settings. -ml


On Mon, Nov 18, 2013 at 8:49 PM, Steven A Robenalt <sr...@stanford.edu>wrote:

> Hi all,
>
> I am attempting to bring up our new app on a 3-node cluster and am having
> problems with frequent read timeouts and slow inter-node replication.
> Initially, these errors were mostly occurring in our app server, affecting
> 0.02%-1.0% of our queries in an otherwise unloaded cluster. No exceptions
> were logged on the servers in this case, and reads in a single node
> environment with the same code and client driver virtually never see
> exceptions like this, so I suspect problems with the inter-cluster
> communication between nodes.
>
> The 3 nodes are deployed in a single AWS VPC, and are all in a common
> subnet. The Cassandra version is 2.0.2 following an upgrade this past
> weekend due to NPEs in a secondary index that were affecting certain
> queries under 2.0.1. The servers are m1.large instances running AWS Linux
> and Oracle JDK7u40. The first 2 nodes in the cluster are the seed nodes.
> All database contents are CQL tables with replication factor of 3, and the
> application is Java-based, using the latest Datastax 2.0.0-rc1 Java Driver.
>
> In testing with the application, I noticed this afternoon that the
> contents of the 3 nodes differed in their respective copies of the same
> table for newly written data, for time periods exceeding several minutes,
> as reported by cqlsh on each node. Specifying different hosts from the same
> server using cqlsh also exhibited timeouts on multiple attempts to connect,
> and on executing some queries, though they eventually succeeded in all
> cases, and eventually the data in all nodes was fully replicated.
>
> The AWS servers have a security group with only ports 22, 7000, 9042, and
> 9160 open.
>
> At this time, it seems that either I am still missing something in my
> cluster configuration, or maybe there are other ports that are needed for
> inter-node communication.
>
> Any advice/suggestions would be appreciated.
>
>
>
> --
> Steve Robenalt
> Software Architect
> HighWire | Stanford University
> 425 Broadway St, Redwood City, CA 94063
>
> srobenal@stanford.edu
> http://highwire.stanford.edu
>
>
>
>
>
>