You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by E S <tr...@yahoo.com> on 2012/05/14 15:00:43 UTC

Odd Node Behavior

Hello,

I am having some very strange issues with a cassandra setup.  I recognize that this is not the ideal cluster setup, but I'd still like to try and understand what is going wrong.

The cluster has 3 machines (A,B,C) running Cassandra 1.0.9 with JNA.  A & B are in datacenter1 while C is in datacenter2.  Cassandra knows about the different datacenter because of the rack inferred snitch.  However, we are currently using a simple placement strategy on the keyspace.  All reads and writes are done with quorum.  Hinted handoffs are enabled.  Most the the cassandra settings are at their defaults, with the exception of thrift message sizes, which we have upped to 256 mb (while very rare, we can sometimes have a few larger rows so wanted a big buffer).  There is a firewall between the two datacenters.  We have enabled TCP traffic for the thrift and storage ports (but not JMX, and no UDP)

Another odd thing is that there are actually 2 cassandra clusters hosted on these machines (although with the same setup).  Each machine has 2 cassandra processes, but everything is running on different ports and different cluster names.

On one of the two clusters we were doing some failover testing.  We would take nodes down quickly in succession and make sure sure the system remained up.

Most of the time, we got a few timeouts on the failover (unexpected, but not the end of the world) and then quickly recovered; however, twice we were able to put the cluster in an unusable state.  We found that sometimes node C, while seemingly up (no load, and marked as UP in the ring by other nodes), was unresponsive to B (when A was down) when B was coordinating a quorum write.  We see B making a request in the logs (on debug) and 10 seconds later timing out.  We see nothing happening in C's log (also debug).  The box is just idling.  In retrospect, I should have put it in trace (will do this next time).  We had it come back after 30 minutes once.  Another time, it came back earlier after cycling it.

I also noticed a few other crazy log messages on C in that time period.  There were two instances of "invalid protocol header", which in code seems to only happen when PROTOCOL_MAGIC doesn't match (MessagingService.java), which seems like an impossible state.

I'm currently at a loss trying to explain what is going on.  Has anyone seen anything like this?  I'd appreciate any additional debugging ideas!  Thanks for any help.

Regards,
Eddie  


Re: Odd Node Behavior

Posted by aaron morton <aa...@thelastpickle.com>.
> Most of the time, we got a few timeouts on the failover (unexpected, but not the end of the world) and then quickly recovered; 
For read or write requests ? I'm guessing with 3 nodes you are using RF 3. In cassandra 1.x the read repair chance is only 10%, so 90% of the time only CL nodes are involved in a read request. If one of the nodes involved dies during the request the coordinator will time out waiting. 
 
> We see B making a request in the logs (on debug) and 10 seconds later timing out.  We see nothing happening in C's log (also debug).  
What were the log messages from the nodes ? In particular the ones from the StorageProxy on Node B and RowMutationVerbHandler on node C.

> In retrospect, I should have put it in trace (will do this next time)
TRACE logs a lot of stuff. I'd hold off on that.  

> I also noticed a few other crazy log messages on C in that time period. 
What were the log messages ? 

>  There were two instances of "invalid protocol header", which in code seems to only happen when PROTOCOL_MAGIC doesn't match (MessagingService.java), which seems like an impossible state.
Often means something other than Cassandra connected on the port. 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 15/05/2012, at 1:00 AM, E S wrote:

> Hello,
> 
> I am having some very strange issues with a cassandra setup.  I recognize that this is not the ideal cluster setup, but I'd still like to try and understand what is going wrong.
> 
> The cluster has 3 machines (A,B,C) running Cassandra 1.0.9 with JNA.  A & B are in datacenter1 while C is in datacenter2.  Cassandra knows about the different datacenter because of the rack inferred snitch.  However, we are currently using a simple placement strategy on the keyspace.  All reads and writes are done with quorum.  Hinted handoffs are enabled.  Most the the cassandra settings are at their defaults, with the exception of thrift message sizes, which we have upped to 256 mb (while very rare, we can sometimes have a few larger rows so wanted a big buffer).  There is a firewall between the two datacenters.  We have enabled TCP traffic for the thrift and storage ports (but not JMX, and no UDP)
> 
> Another odd thing is that there are actually 2 cassandra clusters hosted on these machines (although with the same setup).  Each machine has 2 cassandra processes, but everything is running on different ports and different cluster names.
> 
> On one of the two clusters we were doing some failover testing.  We would take nodes down quickly in succession and make sure sure the system remained up.
> 
> Most of the time, we got a few timeouts on the failover (unexpected, but not the end of the world) and then quickly recovered; however, twice we were able to put the cluster in an unusable state.  We found that sometimes node C, while seemingly up (no load, and marked as UP in the ring by other nodes), was unresponsive to B (when A was down) when B was coordinating a quorum write.  We see B making a request in the logs (on debug) and 10 seconds later timing out.  We see nothing happening in C's log (also debug).  The box is just idling.  In retrospect, I should have put it in trace (will do this next time).  We had it come back after 30 minutes once.  Another time, it came back earlier after cycling it.
> 
> I also noticed a few other crazy log messages on C in that time period.  There were two instances of "invalid protocol header", which in code seems to only happen when PROTOCOL_MAGIC doesn't match (MessagingService.java), which seems like an impossible state.
> 
> I'm currently at a loss trying to explain what is going on.  Has anyone seen anything like this?  I'd appreciate any additional debugging ideas!  Thanks for any help.
> 
> Regards,
> Eddie  
>