You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Sylvain Lebresne (JIRA)" <ji...@apache.org> on 2013/12/12 12:40:07 UTC
[jira] [Commented] (CASSANDRA-6476) Assertion error in MessagingService.addCallback

    [ https://issues.apache.org/jira/browse/CASSANDRA-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846256#comment-13846256 ] 

Sylvain Lebresne commented on CASSANDRA-6476:
---------------------------------------------

MessagingService ain't the native transport (fyi, the native transport code doesn't leak outside the org.apache.cassandra.transport package), it's the intra-cluster messaging. In fact the stack trace shows that the write that trigger it don't even come from the native protocol but from thrift (which means you either use thrift for some things or something is whack).

But truth is, given the stack trace, where the writes comes from doesn't matter.  The assertion that fails is the line
{noformat}
assert previous == null;
{noformat}
in MessagingService.addCallback. And that's where things stop to make sense to me. This means that we tried to add a new message to the callback map but there was one with the same messageId already. Except that messageId is very straighforwardly generated by an {{incrementAndGet}} on an static AtomicInteger. And as far as I can tell, no other code inserts in the callback map without grabing a new messageId this way (except setCallbackForTests, but it does is only use in a unit test).

Therefore, it seems the only way such messageId conflict could happen is that we've gone full cycle on the AtomicInteger and hit the same id again. But entries in callbacks expire after the rpc timeout, so that implies > 4 billions requests in about 10 seconds. Sounds pretty unlikely to me.

But I might be missing something obvious: [~jbellis], I believe you might be more familiar with MessagingService, any idea?


> Assertion error in MessagingService.addCallback
> -----------------------------------------------
>
>                 Key: CASSANDRA-6476
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6476
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Cassandra 2.0.2 DCE
>            Reporter: Theo Hultberg
>            Assignee: Sylvain Lebresne
>
> Two of the three Cassandra nodes in one of our clusters just started behaving very strange about an hour ago. Within a minute of each other they started logging AssertionErrors (see stack traces here: https://gist.github.com/iconara/7917438) over and over again. The client lost connection with the nodes at roughly the same time. The nodes were still up, and even if no clients were connected to them they continued logging the same errors over and over.
> The errors are in the native transport (specifically MessagingService.addCallback) which makes me suspect that it has something to do with a test that we started running this afternoon. I've just implemented support for frame compression in my CQL driver cql-rb. About two hours before this happened I deployed a version of the application which enabled Snappy compression on all frames larger than 64 bytes. It's not impossible that there is a bug somewhere in the driver or compression library that caused this -- but at the same time, it feels like it shouldn't be possible to make C* a zombie with a bad frame.
> Restarting seems to have got them back running again, but I suspect they will go down again sooner or later.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)