You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jeff Jirsa (JIRA)" <ji...@apache.org> on 2016/03/10 20:34:40 UTC
[jira] [Comment Edited] (CASSANDRA-11340) Speculative retry on system_auth tables can cause deadlock

    [ https://issues.apache.org/jira/browse/CASSANDRA-11340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189820#comment-15189820 ] 

Jeff Jirsa edited comment on CASSANDRA-11340 at 3/10/16 7:33 PM:
-----------------------------------------------------------------

Right now, {{system_auth}} is protected against table level modifications, so an end user won't be able to do that by default (though it's a trivial rebuild to remove those protections). That may be the easiest fix, but may not be the 'right' fix, as it will punish users who ignore the RF=N guidance.



was (Author: jjirsa):
Right now, `system_auth` is protected against table level modifications, so an end user won't be able to do that by default (though it's a trivial rebuild to remove those protections). That may be the easiest fix, but may not be the 'right' fix, as it will punish users who ignore the RF=N guidance.


> Speculative retry on system_auth tables can cause deadlock
> ----------------------------------------------------------
>
>                 Key: CASSANDRA-11340
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11340
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jeff Jirsa
>
> Reproduced in at least 2.1.9. 
> It appears possible for queries against system_auth tables to trigger speculative retry, which causes auth to block on traffic going off node. In some cases, it appears possible for threads to become deadlocked, causing load on the nodes to increase sharply. This happens even in clusters with RF of system_auth == N, as all requests being served locally puts the bar for 99% SR pretty low. 
> Incomplete stack trace below, but we haven't yet figured out what exactly is blocking:
> {code}
> Thread 82291: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
>  - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 (Compiled frame)
>  - org.apache.cassandra.utils.concurrent.WaitQueue$AbstractSignal.awaitUntil(long) @bci=28, line=307 (Compiled frame)
>  - org.apache.cassandra.utils.concurrent.SimpleCondition.await(long, java.util.concurrent.TimeUnit) @bci=76, line=63 (Compiled frame)
>  - org.apache.cassandra.service.ReadCallback.await(long, java.util.concurrent.TimeUnit) @bci=25, line=92 (Compiled frame)
>  - org.apache.cassandra.service.AbstractReadExecutor$SpeculatingReadExecutor.maybeTryAdditionalReplicas() @bci=39, line=281 (Compiled frame)
>  - org.apache.cassandra.service.StorageProxy.fetchRows(java.util.List, org.apache.cassandra.db.ConsistencyLevel) @bci=175, line=1338 (Compiled frame)
>  - org.apache.cassandra.service.StorageProxy.readRegular(java.util.List, org.apache.cassandra.db.ConsistencyLevel) @bci=9, line=1274 (Compiled frame)
>  - org.apache.cassandra.service.StorageProxy.read(java.util.List, org.apache.cassandra.db.ConsistencyLevel, org.apache.cassandra.service.ClientState) @bci=57, line=1199 (Compiled frame)
>  - org.apache.cassandra.cql3.statements.SelectStatement.execute(org.apache.cassandra.service.pager.Pageable, org.apache.cassandra.cql3.QueryOptions, int, long, org.apache.cassandra.service.QueryState) @bci=35, line=272 (Compiled frame)
>  - org.apache.cassandra.cql3.statements.SelectStatement.execute(org.apache.cassandra.service.QueryState, org.apache.cassandra.cql3.QueryOptions) @bci=105, line=224 (Compiled frame)
>  - org.apache.cassandra.auth.Auth.selectUser(java.lang.String) @bci=27, line=265 (Compiled frame)
>  - org.apache.cassandra.auth.Auth.isExistingUser(java.lang.String) @bci=1, line=86 (Compiled frame)
>  - org.apache.cassandra.service.ClientState.login(org.apache.cassandra.auth.AuthenticatedUser) @bci=11, line=206 (Compiled frame)
>  - org.apache.cassandra.transport.messages.AuthResponse.execute(org.apache.cassandra.service.QueryState) @bci=58, line=82 (Compiled frame)
>  - org.apache.cassandra.transport.Message$Dispatcher.channelRead0(io.netty.channel.ChannelHandlerContext, org.apache.cassandra.transport.Message$Request) @bci=75, line=439 (Compiled frame)
>  - org.apache.cassandra.transport.Message$Dispatcher.channelRead0(io.netty.channel.ChannelHandlerContext, java.lang.Object) @bci=6, line=335 (Compiled frame)
>  - io.netty.channel.SimpleChannelInboundHandler.channelRead(io.netty.channel.ChannelHandlerContext, java.lang.Object) @bci=17, line=105 (Compiled frame)
>  - io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(java.lang.Object) @bci=9, line=333 (Compiled frame)
>  - io.netty.channel.AbstractChannelHandlerContext.access$700(io.netty.channel.AbstractChannelHandlerContext, java.lang.Object) @bci=2, line=32 (Compiled frame)
>  - io.netty.channel.AbstractChannelHandlerContext$8.run() @bci=8, line=324 (Compiled frame)
>  - java.util.concurrent.Executors$RunnableAdapter.call() @bci=4, line=511 (Compiled frame)
>  - org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run() @bci=5, line=164 (Compiled frame)
>  - org.apache.cassandra.concurrent.SEPWorker.run() @bci=87, line=105 (Interpreted frame)
>  - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
> {code}
> In a cluster with many connected clients (potentially thousands), a reconnection flood (for example, restarting all at once) is likely to trigger this bug. However, it is unlikely to be seen in normal operation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)