You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Jiri Horky <ho...@avast.com> on 2014/08/05 14:48:27 UTC

Fail to reconnect to other nodes after intermittent network failure

Hi,

we experienced a strange problem after intermittent network failure when
the affected node did not reconnect to the rest of the cluster but did
allow to autenticate users (which was not possible during the actual
network outage, see below). The cluster consists of 1 node in each of 3
datacenters, it uses C* 1.2.16 with SSL enabled both to clients and
between C* nodes. The authentication is enabled as well.

The problem started around 2014-08-01 when Cassandra first noticed a
network problem:

 INFO [GossipTasks:1] 2014-08-01 07:47:52,618 Gossiper.java (line 823)
InetAddress /77.234.44.20 is now DOWN
 INFO [GossipTasks:1] 2014-08-01 07:47:55,619 Gossiper.java (line 823)
InetAddress mia10.ff.avast.com/77.234.42.20 is now DOWN

The network came up for a while:

INFO [GossipStage:1] 2014-08-01 07:51:29,380 Gossiper.java (line 809)
InetAddress /77.234.42.20 is now UP
 INFO [HintedHandoff:1] 2014-08-01 07:51:29,381
HintedHandOffManager.java (line 296) Started hinted handoff for host:
9252f37c-1c9a-418b-a49f-6065511946e4 with IP: /77.234.42.20
 INFO [GossipStage:1] 2014-08-01 07:51:29,381 Gossiper.java (line 809)
InetAddress /77.234.44.20 is now UP
 INFO [HintedHandoff:2] 2014-08-01 07:51:29,385
HintedHandOffManager.java (line 296) Started hinted handoff for host:
97b1943a-3689-4e4a-a39d-d5a11c0cc309 with IP: /77.234.44.20

But it failed to send hints:


 INFO [HintedHandoff:1] 2014-08-01 07:51:39,389
HintedHandOffManager.java (line 427) Timed out replaying hints to
/77.234.42.20; aborting (0 delivered)
 INFO [HintedHandoff:2] 2014-08-01 07:51:39,390
HintedHandOffManager.java (line 427) Timed out replaying hints to
/77.234.44.20; aborting (0 delivered)

Also, the log started to be "flooded" with failed autentication tries.
My understanding is that authentication data are read with QUORUM which
failed as the other two nodes were down:


ERROR [Native-Transport-Requests:446116] 2014-08-01 07:51:39,985
QueryMessage.java (line 97) Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException:
java.lang.RuntimeException:
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed
out - received only 0 responses.
        at
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2258)
        at com.google.common.cache.LocalCache.get(LocalCache.java:3990)
        at
com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3994)
        at
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4878)
        at
org.apache.cassandra.service.ClientState.authorize(ClientState.java:292)
        at
org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:172)
        at
org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:165)
        at
org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:149)
        at
org.apache.cassandra.cql3.statements.SelectStatement.checkAccess(SelectStatement.java:116)
        at
org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:102)
        at
org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:113)
        at
org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:87)
        at
org.apache.cassandra.transport.Message$Dispatcher.messageReceived(Message.java:287)
        at
org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at
org.jboss.netty.handler.execution.ChannelUpstreamEventRunnable.doRun(ChannelUpstreamEventRunnable.java:43)
        at
org.jboss.netty.handler.execution.ChannelEventRunnable.run(ChannelEventRunnable.java:67)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.RuntimeException:
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed
out - received only 0 responses.
        at org.apache.cassandra.auth.Auth.selectUser(Auth.java:256)
        at org.apache.cassandra.auth.Auth.isSuperuser(Auth.java:84)
        at
org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:50)
        at
org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:68)
        at
org.apache.cassandra.service.ClientState$1.load(ClientState.java:278)
        at
org.apache.cassandra.service.ClientState$1.load(ClientState.java:275)
        at
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3589)
        at
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2374)
        at
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2337)
        at
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2252)
        ... 20 more
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException:
Operation timed out - received only 0 responses.
        at
org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:105)
        at
org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:930)
        at
org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:815)
        at
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:140)
        at org.apache.cassandra.auth.Auth.selectUser(Auth.java:245)
        ... 29 more

These exception stopped to appear at 2014-08-01 08:59:17,48 (and did not
reapper after that), and cassandra seemed  to be happy as it marked the
other two nodes as up for a while. But after a few tens of seconds, it
marked them as DOWN again and kept it in this state until restart three
days later, even the network connectivity was stable:

 INFO [GossipStage:1] 2014-08-01 09:02:37,305 Gossiper.java (line 809)
InetAddress /77.234.42.20 is now UP
 INFO [HintedHandoff:2] 2014-08-01 09:02:37,305
HintedHandOffManager.java (line 296) Started hinted handoff for host:
9252f37c-1c9a-418b-a49f-6065511946e4 with IP: /77.234.42.20
 INFO [GossipStage:1] 2014-08-01 09:02:37,305 Gossiper.java (line 809)
InetAddress /77.234.44.20 is now UP
 INFO [HintedHandoff:1] 2014-08-01 09:02:37,308
HintedHandOffManager.java (line 296) Started hinted handoff for host:
97b1943a-3689-4e4a-a39d-d5a11c0cc309 with IP: /77.234.44.20
 INFO [BatchlogTasks:1] 2014-08-01 09:02:45,724 ColumnFamilyStore.java
(line 633) Enqueuing flush of Memtable-batchlog@1311733948(239547/247968
serialized/live bytes, 32 ops)
 INFO [FlushWriter:1221] 2014-08-01 09:02:45,725 Memtable.java (line
398) Writing Memtable-batchlog@1311733948(239547/247968 serialized/live
bytes, 32 ops)
 INFO [FlushWriter:1221] 2014-08-01 09:02:45,738 Memtable.java (line
443) Completed flushing; nothing needed to be retained.  Commitlog
position was ReplayPosition(segmentId=1403712545417, position=20482758)
 INFO [HintedHandoff:1] 2014-08-01 09:02:47,312
HintedHandOffManager.java (line 427) Timed out replaying hints to
/77.234.44.20; aborting (0 delivered)
 INFO [HintedHandoff:2] 2014-08-01 09:02:47,312
HintedHandOffManager.java (line 427) Timed out replaying hints to
/77.234.42.20; aborting (0 delivered)
 INFO [GossipTasks:1] 2014-08-01 09:02:59,690 Gossiper.java (line 823)
InetAddress /77.234.44.20 is now DOWN
 INFO [GossipTasks:1] 2014-08-01 09:03:08,693 Gossiper.java (line 823)
InetAddress mia10.ff.avast.com/77.234.42.20 is now DOWN

After one day, the node started to discards hints it saved for the other
nodes.

 WARN [OptionalTasks:1] 2014-08-02 09:09:10,277
HintedHandoffMetrics.java (line 78) /77.234.44.20 has 9039 dropped
hints, because node is down past configured hint window.
 WARN [OptionalTasks:1] 2014-08-02 09:09:10,290
HintedHandoffMetrics.java (line 78) /77.234.42.20 has 8816 dropped
hints, because node is down past configured hint window.


The "nodetool status" marked the other two nodes as down. After restart,
it reconnected to the cluster without any problem.

What puzzles me is the fact that the authentization apparently started
to work after the network recovered but the exchange of data did not.

I would like to understand what could caused the problems and how to
avoid  them in the future.

Any pointers would be appreciated.

Regards
Jiri Horky


Re: Fail to reconnect to other nodes after intermittent network failure

Posted by Jiri Horky <ho...@avast.com>.
OK, ticket 7696 [1] created.

Jiri Horky

https://issues.apache.org/jira/browse/CASSANDRA-7696

On 08/05/2014 07:57 PM, Robert Coli wrote:
>
> On Tue, Aug 5, 2014 at 5:48 AM, Jiri Horky <horky@avast.com
> <ma...@avast.com>> wrote:
>
>     What puzzles me is the fact that the authentization apparently started
>     to work after the network recovered but the exchange of data did not.
>
>     I would like to understand what could caused the problems and how to
>     avoid  them in the future.
>
>
> Very few people use SSL and very few people use auth, you have
> probably hit an edge case.
>
> I would file a JIRA with the details you described above.
>
> =Rob
>  


Re: Fail to reconnect to other nodes after intermittent network failure

Posted by Robert Coli <rc...@eventbrite.com>.
On Tue, Aug 5, 2014 at 5:48 AM, Jiri Horky <ho...@avast.com> wrote:

> What puzzles me is the fact that the authentization apparently started
> to work after the network recovered but the exchange of data did not.
>
> I would like to understand what could caused the problems and how to
> avoid  them in the future.
>

Very few people use SSL and very few people use auth, you have probably hit
an edge case.

I would file a JIRA with the details you described above.

=Rob