You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jiri Horky (JIRA)" <ji...@apache.org> on 2014/08/05 21:48:12 UTC
[jira] [Created] (CASSANDRA-7696) Fail to reconnect to other nodes after intermittent network failure (SSL + auth enabled)

Jiri Horky created CASSANDRA-7696:
-------------------------------------

             Summary: Fail to reconnect to other nodes after intermittent network failure (SSL + auth enabled)
                 Key: CASSANDRA-7696
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7696
             Project: Cassandra
          Issue Type: Bug
          Components: Core
         Environment: Cassandra 1.2.16, CentOS 6.5, Cluster spread to 3 datacenters with 1 node each. SSL enabled between Cassandra nodes and also to clients, authentization enabled.
            Reporter: Jiri Horky


We experienced a strange problem after intermittent network failure when
the affected node did not reconnect to the rest of the cluster but did
allow to autenticate users (which was not possible during the actual
network outage, see below). The cluster consists of 1 node in each of 3
datacenters, it uses C* 1.2.16 with SSL enabled both to clients and
between C* nodes. The authentication is enabled as well.

The problem started around 2014-08-01 when Cassandra first noticed a
network problem:

 INFO [GossipTasks:1] 2014-08-01 07:47:52,618 Gossiper.java (line 823)
InetAddress /A.B.D.20 is now DOWN
 INFO [GossipTasks:1] 2014-08-01 07:47:55,619 Gossiper.java (line 823)
InetAddress mia10/A.B.C.20 is now DOWN

The network came up for a while:

INFO [GossipStage:1] 2014-08-01 07:51:29,380 Gossiper.java (line 809)
InetAddress /A.B.C.20 is now UP
 INFO [HintedHandoff:1] 2014-08-01 07:51:29,381
HintedHandOffManager.java (line 296) Started hinted handoff for host:
9252f37c-1c9a-418b-a49f-6065511946e4 with IP: /A.B.C.20
 INFO [GossipStage:1] 2014-08-01 07:51:29,381 Gossiper.java (line 809)
InetAddress /A.B.D.20 is now UP
 INFO [HintedHandoff:2] 2014-08-01 07:51:29,385
HintedHandOffManager.java (line 296) Started hinted handoff for host:
97b1943a-3689-4e4a-a39d-d5a11c0cc309 with IP: /A.B.D.20

But it failed to send hints:


 INFO [HintedHandoff:1] 2014-08-01 07:51:39,389
HintedHandOffManager.java (line 427) Timed out replaying hints to
/A.B.C.20; aborting (0 delivered)
 INFO [HintedHandoff:2] 2014-08-01 07:51:39,390
HintedHandOffManager.java (line 427) Timed out replaying hints to
/A.B.D.20; aborting (0 delivered)

Also, the log started to be "flooded" with failed autentication tries.
My understanding is that authentication data are read with QUORUM which
failed as the other two nodes were down:


ERROR [Native-Transport-Requests:446116] 2014-08-01 07:51:39,985
QueryMessage.java (line 97) Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException:
java.lang.RuntimeException:
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed
out - received only 0 responses.
        at
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2258)
        at com.google.common.cache.LocalCache.get(LocalCache.java:3990)
        at
com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3994)
        at
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4878)
        at
org.apache.cassandra.service.ClientState.authorize(ClientState.java:292)
        at
org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:172)
        at
org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:165)
        at
org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:149)
        at
org.apache.cassandra.cql3.statements.SelectStatement.checkAccess(SelectStatement.java:116)
        at
org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:102)
        at
org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:113)
        at
org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:87)
        at
org.apache.cassandra.transport.Message$Dispatcher.messageReceived(Message.java:287)
        at
org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at
org.jboss.netty.handler.execution.ChannelUpstreamEventRunnable.doRun(ChannelUpstreamEventRunnable.java:43)
        at
org.jboss.netty.handler.execution.ChannelEventRunnable.run(ChannelEventRunnable.java:67)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.RuntimeException:
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed
out - received only 0 responses.
        at org.apache.cassandra.auth.Auth.selectUser(Auth.java:256)
        at org.apache.cassandra.auth.Auth.isSuperuser(Auth.java:84)
        at
org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:50)
        at
org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:68)
        at
org.apache.cassandra.service.ClientState$1.load(ClientState.java:278)
        at
org.apache.cassandra.service.ClientState$1.load(ClientState.java:275)
        at
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3589)
        at
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2374)
        at
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2337)
        at
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2252)
        ... 20 more
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException:
Operation timed out - received only 0 responses.
        at
org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:105)
        at
org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:930)
        at
org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:815)
        at
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:140)
        at org.apache.cassandra.auth.Auth.selectUser(Auth.java:245)
        ... 29 more

These exception stopped to appear at 2014-08-01 08:59:17,48 (and did not
reapper after that), and cassandra seemed  to be happy as it marked the
other two nodes as up for a while. But after a few tens of seconds, it
marked them as DOWN again and kept it in this state until restart three
days later, even the network connectivity was stable:

 INFO [GossipStage:1] 2014-08-01 09:02:37,305 Gossiper.java (line 809)
InetAddress /A.B.C.20 is now UP
 INFO [HintedHandoff:2] 2014-08-01 09:02:37,305
HintedHandOffManager.java (line 296) Started hinted handoff for host:
9252f37c-1c9a-418b-a49f-6065511946e4 with IP: /A.B.C.20
 INFO [GossipStage:1] 2014-08-01 09:02:37,305 Gossiper.java (line 809)
InetAddress /A.B.D.20 is now UP
 INFO [HintedHandoff:1] 2014-08-01 09:02:37,308
HintedHandOffManager.java (line 296) Started hinted handoff for host:
97b1943a-3689-4e4a-a39d-d5a11c0cc309 with IP: /A.B.D.20
 INFO [BatchlogTasks:1] 2014-08-01 09:02:45,724 ColumnFamilyStore.java
(line 633) Enqueuing flush of Memtable-batchlog@1311733948(239547/247968
serialized/live bytes, 32 ops)
 INFO [FlushWriter:1221] 2014-08-01 09:02:45,725 Memtable.java (line
398) Writing Memtable-batchlog@1311733948(239547/247968 serialized/live
bytes, 32 ops)
 INFO [FlushWriter:1221] 2014-08-01 09:02:45,738 Memtable.java (line
443) Completed flushing; nothing needed to be retained.  Commitlog
position was ReplayPosition(segmentId=1403712545417, position=20482758)
 INFO [HintedHandoff:1] 2014-08-01 09:02:47,312
HintedHandOffManager.java (line 427) Timed out replaying hints to
/A.B.D.20; aborting (0 delivered)
 INFO [HintedHandoff:2] 2014-08-01 09:02:47,312
HintedHandOffManager.java (line 427) Timed out replaying hints to
/A.B.C.20; aborting (0 delivered)
 INFO [GossipTasks:1] 2014-08-01 09:02:59,690 Gossiper.java (line 823)
InetAddress /A.B.D.20 is now DOWN
 INFO [GossipTasks:1] 2014-08-01 09:03:08,693 Gossiper.java (line 823)
InetAddress mia10/A.B.C.20 is now DOWN

After one day, the node started to discards hints it saved for the other
nodes.

 WARN [OptionalTasks:1] 2014-08-02 09:09:10,277
HintedHandoffMetrics.java (line 78) /A.B.D.20 has 9039 dropped
hints, because node is down past configured hint window.
 WARN [OptionalTasks:1] 2014-08-02 09:09:10,290
HintedHandoffMetrics.java (line 78) /A.B.C.20 has 8816 dropped
hints, because node is down past configured hint window.


The "nodetool status" marked the other two nodes as down. After restart,
it reconnected to the cluster without any problem.

What puzzles me is the fact that the authentization apparently started
to work after the network recovered but the exchange of data did not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)