You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Aleksey Yeschenko (JIRA)" <ji...@apache.org> on 2015/11/11 18:05:11 UTC

[jira] [Resolved] (CASSANDRA-7696) Fail to reconnect to other nodes after intermittent network failure (SSL + auth enabled)

     [ https://issues.apache.org/jira/browse/CASSANDRA-7696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aleksey Yeschenko resolved CASSANDRA-7696.
------------------------------------------
    Resolution: Cannot Reproduce

> Fail to reconnect to other nodes after intermittent network failure (SSL + auth enabled)
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-7696
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7696
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Cassandra 1.2.16, CentOS 6.5, Cluster spread to 3 datacenters with 1 node each. SSL enabled between Cassandra nodes and also to clients, authentization enabled.
>            Reporter: Jiri Horky
>
> We experienced a strange problem after intermittent network failure when
> the affected node did not reconnect to the rest of the cluster but did
> allow to autenticate users (which was not possible during the actual
> network outage, see below). The cluster consists of 1 node in each of 3
> datacenters, it uses C* 1.2.16 with SSL enabled both to clients and
> between C* nodes. The authentication is enabled as well.
> The problem started around 2014-08-01 when Cassandra first noticed a
> network problem:
>  INFO [GossipTasks:1] 2014-08-01 07:47:52,618 Gossiper.java (line 823)
> InetAddress /A.B.D.20 is now DOWN
>  INFO [GossipTasks:1] 2014-08-01 07:47:55,619 Gossiper.java (line 823)
> InetAddress mia10/A.B.C.20 is now DOWN
> The network came up for a while:
> INFO [GossipStage:1] 2014-08-01 07:51:29,380 Gossiper.java (line 809)
> InetAddress /A.B.C.20 is now UP
>  INFO [HintedHandoff:1] 2014-08-01 07:51:29,381
> HintedHandOffManager.java (line 296) Started hinted handoff for host:
> 9252f37c-1c9a-418b-a49f-6065511946e4 with IP: /A.B.C.20
>  INFO [GossipStage:1] 2014-08-01 07:51:29,381 Gossiper.java (line 809)
> InetAddress /A.B.D.20 is now UP
>  INFO [HintedHandoff:2] 2014-08-01 07:51:29,385
> HintedHandOffManager.java (line 296) Started hinted handoff for host:
> 97b1943a-3689-4e4a-a39d-d5a11c0cc309 with IP: /A.B.D.20
> But it failed to send hints:
>  INFO [HintedHandoff:1] 2014-08-01 07:51:39,389
> HintedHandOffManager.java (line 427) Timed out replaying hints to
> /A.B.C.20; aborting (0 delivered)
>  INFO [HintedHandoff:2] 2014-08-01 07:51:39,390
> HintedHandOffManager.java (line 427) Timed out replaying hints to
> /A.B.D.20; aborting (0 delivered)
> Also, the log started to be "flooded" with failed autentication tries.
> My understanding is that authentication data are read with QUORUM which
> failed as the other two nodes were down:
> ERROR [Native-Transport-Requests:446116] 2014-08-01 07:51:39,985
> QueryMessage.java (line 97) Unexpected error during query
> com.google.common.util.concurrent.UncheckedExecutionException:
> java.lang.RuntimeException:
> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed
> out - received only 0 responses.
>         at
> com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2258)
>         at com.google.common.cache.LocalCache.get(LocalCache.java:3990)
>         at
> com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3994)
>         at
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4878)
>         at
> org.apache.cassandra.service.ClientState.authorize(ClientState.java:292)
>         at
> org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:172)
>         at
> org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:165)
>         at
> org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:149)
>         at
> org.apache.cassandra.cql3.statements.SelectStatement.checkAccess(SelectStatement.java:116)
>         at
> org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:102)
>         at
> org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:113)
>         at
> org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:87)
>         at
> org.apache.cassandra.transport.Message$Dispatcher.messageReceived(Message.java:287)
>         at
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>         at
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>         at
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>         at
> org.jboss.netty.handler.execution.ChannelUpstreamEventRunnable.doRun(ChannelUpstreamEventRunnable.java:43)
>         at
> org.jboss.netty.handler.execution.ChannelEventRunnable.run(ChannelEventRunnable.java:67)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:722)
> Caused by: java.lang.RuntimeException:
> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed
> out - received only 0 responses.
>         at org.apache.cassandra.auth.Auth.selectUser(Auth.java:256)
>         at org.apache.cassandra.auth.Auth.isSuperuser(Auth.java:84)
>         at
> org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:50)
>         at
> org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:68)
>         at
> org.apache.cassandra.service.ClientState$1.load(ClientState.java:278)
>         at
> org.apache.cassandra.service.ClientState$1.load(ClientState.java:275)
>         at
> com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3589)
>         at
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2374)
>         at
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2337)
>         at
> com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2252)
>         ... 20 more
> Caused by: org.apache.cassandra.exceptions.ReadTimeoutException:
> Operation timed out - received only 0 responses.
>         at
> org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:105)
>         at
> org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:930)
>         at
> org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:815)
>         at
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:140)
>         at org.apache.cassandra.auth.Auth.selectUser(Auth.java:245)
>         ... 29 more
> These exception stopped to appear at 2014-08-01 08:59:17,48 (and did not
> reapper after that), and cassandra seemed  to be happy as it marked the
> other two nodes as up for a while. But after a few tens of seconds, it
> marked them as DOWN again and kept it in this state until restart three
> days later, even the network connectivity was stable:
>  INFO [GossipStage:1] 2014-08-01 09:02:37,305 Gossiper.java (line 809)
> InetAddress /A.B.C.20 is now UP
>  INFO [HintedHandoff:2] 2014-08-01 09:02:37,305
> HintedHandOffManager.java (line 296) Started hinted handoff for host:
> 9252f37c-1c9a-418b-a49f-6065511946e4 with IP: /A.B.C.20
>  INFO [GossipStage:1] 2014-08-01 09:02:37,305 Gossiper.java (line 809)
> InetAddress /A.B.D.20 is now UP
>  INFO [HintedHandoff:1] 2014-08-01 09:02:37,308
> HintedHandOffManager.java (line 296) Started hinted handoff for host:
> 97b1943a-3689-4e4a-a39d-d5a11c0cc309 with IP: /A.B.D.20
>  INFO [BatchlogTasks:1] 2014-08-01 09:02:45,724 ColumnFamilyStore.java
> (line 633) Enqueuing flush of Memtable-batchlog@1311733948(239547/247968
> serialized/live bytes, 32 ops)
>  INFO [FlushWriter:1221] 2014-08-01 09:02:45,725 Memtable.java (line
> 398) Writing Memtable-batchlog@1311733948(239547/247968 serialized/live
> bytes, 32 ops)
>  INFO [FlushWriter:1221] 2014-08-01 09:02:45,738 Memtable.java (line
> 443) Completed flushing; nothing needed to be retained.  Commitlog
> position was ReplayPosition(segmentId=1403712545417, position=20482758)
>  INFO [HintedHandoff:1] 2014-08-01 09:02:47,312
> HintedHandOffManager.java (line 427) Timed out replaying hints to
> /A.B.D.20; aborting (0 delivered)
>  INFO [HintedHandoff:2] 2014-08-01 09:02:47,312
> HintedHandOffManager.java (line 427) Timed out replaying hints to
> /A.B.C.20; aborting (0 delivered)
>  INFO [GossipTasks:1] 2014-08-01 09:02:59,690 Gossiper.java (line 823)
> InetAddress /A.B.D.20 is now DOWN
>  INFO [GossipTasks:1] 2014-08-01 09:03:08,693 Gossiper.java (line 823)
> InetAddress mia10/A.B.C.20 is now DOWN
> After one day, the node started to discards hints it saved for the other
> nodes.
>  WARN [OptionalTasks:1] 2014-08-02 09:09:10,277
> HintedHandoffMetrics.java (line 78) /A.B.D.20 has 9039 dropped
> hints, because node is down past configured hint window.
>  WARN [OptionalTasks:1] 2014-08-02 09:09:10,290
> HintedHandoffMetrics.java (line 78) /A.B.C.20 has 8816 dropped
> hints, because node is down past configured hint window.
> The "nodetool status" marked the other two nodes as down. After restart,
> it reconnected to the cluster without any problem.
> What puzzles me is the fact that the authentization apparently started
> to work after the network recovered but the exchange of data did not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)