You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Xiangfei Ni <xi...@cm-dt.com> on 2018/03/27 02:56:08 UTC

A node down every day in a 6 nodes cluster

Hi Cassandra experts,
  I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster is just in one DC,
  Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business CF is 3,a node downs one time every day,the system.log shows below info:
WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>
ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]
        at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.jar:3.9]
        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]
        ... 26 common frames omitted
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93) ~[apache-cassandra-3.9.jar:3.9]
        ... 32 common frames omitted
WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>
ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:
cassandra@cqlsh:system_auth> select * from role_permissions where role = 'nev_tsp_sa';

role       | resource          | permissions
------------+-------------------+--------------------------------------------------------------
nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'}

the cache disk can be read/write as normal.

Highly appreciated if anyone can help,thanks very much !


Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811|Tel: + 86 27 5024 2516


答复: 答复: A node down every day in a 6 nodes cluster

Posted by Xiangfei Ni <xi...@cm-dt.com>.
I have checked the dmesg and message logs ,there is no eth* content in it.so I think there was no network connection issue.

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811|Tel: + 86 27 5024 2516

发件人: daemeon reiydelle <da...@gmail.com>
发送时间: 2018年3月27日 11:42
收件人: user <us...@cassandra.apache.org>
主题: Re: 答复: A node down every day in a 6 nodes cluster

Look for errors on your network interface. I think you have periodic errors in your network connectivity


<======>
"Who do you think made the first stone spear? The Asperger guy.
If you get rid of the autism genetics, there would be no Silicon Valley"
Temple Grandin
Daemeon C.M. Reiydelle
San Francisco 1.415.501.0198
London 44 020 8144 9872

On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni <xi...@cm-dt.com>> wrote:
Hi Jeff,
    I need to restart the node manually every time,only one node has this problem.
    I have attached the nodetool output,thanks.

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811<tel:+86%20137%209700%207811>|Tel: + 86 27 5024 2516<tel:+86%2027%205024%202516>

发件人: Jeff Jirsa <jj...@gmail.com>>
发送时间: 2018年3月27日 11:03
收件人: user@cassandra.apache.org<ma...@cassandra.apache.org>
主题: Re: A node down every day in a 6 nodes cluster

That warning isn’t sufficient to understand why the node is going down


Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is likely a good idea

Are the nodes coming up on their own? Or are you restarting them?

Paste the output of nodetool tpstats and nodetool cfstats



--
Jeff Jirsa


On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xi...@cm-dt.com>> wrote:
Hi Cassandra experts,
  I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster is just in one DC,
  Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business CF is 3,a node downs one time every day,the system.log shows below info:
WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>
ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]
        at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.jar:3.9]
        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]
        ... 26 common frames omitted
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93) ~[apache-cassandra-3.9.jar:3.9]
        ... 32 common frames omitted
WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>
ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:
cassandra@cqlsh:system_auth> select * from role_permissions where role = 'nev_tsp_sa';

role       | resource          | permissions
------------+-------------------+--------------------------------------------------------------
nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'}

the cache disk can be read/write as normal.

Highly appreciated if anyone can help,thanks very much !


Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811<tel:+86%20137%209700%207811>|Tel: + 86 27 5024 2516<tel:+86%2027%205024%202516>



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org<ma...@cassandra.apache.org>
For additional commands, e-mail: user-help@cassandra.apache.org<ma...@cassandra.apache.org>


Re: 答复: A node down every day in a 6 nodes cluster

Posted by daemeon reiydelle <da...@gmail.com>.
Look for errors on your network interface. I think you have periodic errors
in your network connectivity


<======>
"Who do you think made the first stone spear? The Asperger guy.
If you get rid of the autism genetics, there would be no Silicon Valley"
Temple Grandin


*Daemeon C.M. ReiydelleSan Francisco 1.415.501.0198London 44 020 8144 9872*


On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:

> Hi Jeff,
>
>     I need to restart the node manually every time,only one node has this
> problem.
>
>     I have attached the nodetool output,thanks.
>
>
>
> Best Regards,
>
>
>
> 倪项菲*/ **David Ni*
>
> 中移德电网络科技有限公司
>
> Virtue Intelligent Network Ltd, co.
>
> Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
>
> Mob: +86 13797007811 <+86%20137%209700%207811>|Tel: + 86 27 5024 2516
> <+86%2027%205024%202516>
>
>
>
> *发件人:* Jeff Jirsa <jj...@gmail.com>
> *发送时间:* 2018年3月27日 11:03
> *收件人:* user@cassandra.apache.org
> *主题:* Re: A node down every day in a 6 nodes cluster
>
>
>
> That warning isn’t sufficient to understand why the node is going down
>
>
>
>
>
> Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3
> is likely a good idea
>
>
>
> Are the nodes coming up on their own? Or are you restarting them?
>
>
>
> Paste the output of nodetool tpstats and nodetool cfstats
>
>
>
>
>
>
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:
>
> Hi Cassandra experts,
>
>   I am facing an issue,a node downs every day in a 6 nodes cluster,the
> cluster is just in one DC,
>
>   Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m
> HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business
> CF is 3,a node downs one time every day,the system.log shows below info:
>
> WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128
> CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize
> #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>
>
> ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129
> QueryMessage.java:128 - Unexpected error during query
>
> com.google.common.util.concurrent.UncheckedExecutionException:
> java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException:
> Operation timed out - received only 0 responses.
>
>         at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203)
> ~[guava-18.0.jar:na]
>
>         at com.google.common.cache.LocalCache.get(LocalCache.java:3937)
> ~[guava-18.0.jar:na]
>
>         at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941)
> ~[guava-18.0.jar:na]
>
>         at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)
> ~[guava-18.0.jar:na]
>
>         at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.ClientState.
> checkPermissionOnResourceChain(ClientState.java:352)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.cql3.statements.ModificationStatement.
> checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.jar:3.
> 9]
>
>         at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513)
> [apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407)
> [apache-cassandra-3.9.jar:3.9]
>
>         at io.netty.channel.SimpleChannelInboundHandler.channelRead(
> SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.
> 0.39.Final]
>
>         at io.netty.channel.AbstractChannelHandlerContext.
> invokeChannelRead(AbstractChannelHandlerContext.java:366)
> [netty-all-4.0.39.Final.jar:4.0.39.Final]
>
>         at io.netty.channel.AbstractChannelHandlerContext.access$600(
> AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.
> 0.39.Final]
>
>         at io.netty.channel.AbstractChannelHandlerContext$7.run(
> AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.
> 0.39.Final]
>
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [na:1.8.0_91]
>
>         at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorServ
> ice$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
> [apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109)
> [apache-cassandra-3.9.jar:3.9]
>
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
>
> Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException:
> Operation timed out - received only 0 responses.
>
>         at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at com.google.common.cache.LocalCache$LoadingValueReference.
> loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]
>
>         at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319)
> ~[guava-18.0.jar:na]
>
>         at com.google.common.cache.LocalCache$Segment.
> lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]
>
>         at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197)
> ~[guava-18.0.jar:na]
>
>         ... 26 common frames omitted
>
> Caused by: org.apache.cassandra.exceptions.ReadTimeoutException:
> Operation timed out - received only 0 responses.
>
>         at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.StorageProxy$
> SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.db.SinglePartitionReadCommand$
> Group.execute(SinglePartitionReadCommand.java:975)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.cql3.statements.SelectStatement.
> execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.cql3.statements.SelectStatement.
> execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.auth.CassandraAuthorizer.
> addPermissionsForRole(CassandraAuthorizer.java:227)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         ... 32 common frames omitted
>
> WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131
> CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize
> #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>
>
> ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135
> QueryMessage.java:128 - Unexpected error during query
>
> com.google.common.util.concurrent.UncheckedExecutionException:
> java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException:
> Operation timed out - received only 0 responses.
>
>         at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203)
> ~[guava-18.0.jar:na]
>
>
>
> I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:
>
> cassandra@cqlsh:system_auth> select * from role_permissions where role =
> 'nev_tsp_sa';
>
>
>
> role       | resource          | permissions
>
> ------------+-------------------+---------------------------
> -----------------------------------
>
> nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP',
> 'MODIFY', 'SELECT'}
>
>
>
> the cache disk can be read/write as normal.
>
>
>
> Highly appreciated if anyone can help,thanks very much !
>
>
>
>
>
> Best Regards,
>
>
>
> 倪项菲*/ **David Ni*
>
> 中移德电网络科技有限公司
>
> Virtue Intelligent Network Ltd, co.
>
> Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
>
> Mob: +86 13797007811 <+86%20137%209700%207811>|Tel: + 86 27 5024 2516
> <+86%2027%205024%202516>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>

答复: 答复: A node down every day in a 6 nodes cluster

Posted by Xiangfei Ni <xi...@cm-dt.com>.
Hi Jeff,
This is very strange that the table info that the max partition is 129557750(123M):

[cid:image001.png@01D3C5E0.0A9F8910]
But why the index max partition is nearly 2G,
             Table (index): rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver
                SSTable count: 7
                Space used (live): 1049948206
                Space used (total): 1049948206
                Space used by snapshots (total): 0
                Off heap memory used (total): 377947
                SSTable Compression Ratio: 0.3381407723053012
                Number of keys (estimate): 2
                Memtable cell count: 9435
                Memtable data size: 429878
                Memtable off heap memory used: 0
                Memtable switch count: 0
                Local read count: 0
                Local read latency: NaN ms
                Local write count: 212512
                Local write latency: 0.052 ms
                Pending flushes: 0
                Percent repaired: 0.0
                Bloom filter false positives: 0
                Bloom filter false ratio: 0.00000
                Bloom filter space used: 112
                Bloom filter off heap memory used: 56
                Index summary off heap memory used: 91
                Compression metadata off heap memory used: 377800
                Compacted partition minimum bytes: 785940
                Compacted partition maximum bytes: 1996099046
                Compacted partition mean bytes: 495191984
                Average live cells per slice (last five minutes): NaN
                Maximum live cells per slice (last five minutes): 0
                Average tombstones per slice (last five minutes): NaN
                Maximum tombstones per slice (last five minutes): 0
                Dropped Mutations: 0

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811|Tel: + 86 27 5024 2516

发件人: Jeff Jirsa <jj...@gmail.com>
发送时间: 2018年3月27日 11:50
收件人: Xiangfei Ni <xi...@cm-dt.com>
抄送: user@cassandra.apache.org
主题: Re: 答复: A node down every day in a 6 nodes cluster

Only one node having the problem is suspicious. May be that your application is improperly pooling connections, or you have a hardware problem.

I dont see anything in nodetool that explains it, though you certainly have a data model likely to cause problems over time (the cardinality of

rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver is such that you have very wide partitions and it'll be difficult to read).





On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni <xi...@cm-dt.com>> wrote:
Hi Jeff,
    I need to restart the node manually every time,only one node has this problem.
    I have attached the nodetool output,thanks.

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811<tel:+86%20137%209700%207811>|Tel: + 86 27 5024 2516<tel:+86%2027%205024%202516>

发件人: Jeff Jirsa <jj...@gmail.com>>
发送时间: 2018年3月27日 11:03
收件人: user@cassandra.apache.org<ma...@cassandra.apache.org>
主题: Re: A node down every day in a 6 nodes cluster

That warning isn’t sufficient to understand why the node is going down


Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is likely a good idea

Are the nodes coming up on their own? Or are you restarting them?

Paste the output of nodetool tpstats and nodetool cfstats



--
Jeff Jirsa


On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xi...@cm-dt.com>> wrote:
Hi Cassandra experts,
  I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster is just in one DC,
  Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business CF is 3,a node downs one time every day,the system.log shows below info:
WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>
ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]
        at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.jar:3.9]
        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]
        ... 26 common frames omitted
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93) ~[apache-cassandra-3.9.jar:3.9]
        ... 32 common frames omitted
WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>
ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:
cassandra@cqlsh:system_auth> select * from role_permissions where role = 'nev_tsp_sa';

role       | resource          | permissions
------------+-------------------+--------------------------------------------------------------
nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'}

the cache disk can be read/write as normal.

Highly appreciated if anyone can help,thanks very much !


Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811<tel:+86%20137%209700%207811>|Tel: + 86 27 5024 2516<tel:+86%2027%205024%202516>



RE: 答复: 答复: A node down every day in a 6 nodes cluster

Posted by Kenneth Brotman <ke...@yahoo.com.INVALID>.
Properly Sizing Your Heap to Prevent OutOfMemoryErrors

https://support.datastax.com/hc/en-us/articles/204225929-Properly-Sizing-Your-Heap-to-Prevent-OutOfMemoryErrors

 

 

From: Kenneth Brotman [mailto:kenbrotman@yahoo.com.INVALID] 
Sent: Wednesday, March 28, 2018 5:35 AM
To: user@cassandra.apache.org
Subject: RE: 答复: 答复: A node down every day in a 6 nodes cluster

 

If you think that will fix the problem, maybe you could add a little more memory to each machine as a short term fix.

 

From: Xiangfei Ni [mailto:xiangfei.ni@cm-dt.com] 
Sent: Wednesday, March 28, 2018 5:24 AM
To: user@cassandra.apache.org
Subject: 答复: 答复: 答复: A node down every day in a 6 nodes cluster

 

Yes ,we discussed and plan to figured out the data model issue and upgrade to 3.11.3 version.

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811|Tel: + 86 27 5024 2516

 

发件人: Kenneth Brotman <ke...@yahoo.com.INVALID> 
发送时间: 2018年3月28日 20:16
收件人: user@cassandra.apache.org
主题: RE: 答复: 答复: A node down every day in a 6 nodes cluster

 

David, 

 

Did you figure out what to do about the data model problem?  It could be that your data files finally grow to the point that the data model problem caused the Java heap space issue – in which case everything is actually working as it’s supposed to; You just have to fix the data model.

 

Kenneth Brotman

 

 

From: Kenneth Brotman [ <ma...@yahoo.com> mailto:kenbrotman@yahoo.com] 
Sent: Wednesday, March 28, 2018 4:46 AM
To: 'user@cassandra.apache.org'
Subject: RE: 答复: 答复: A node down every day in a 6 nodes cluster

 

Was any change to hardware done around the time the problem started ?

Was any change to the client software done around the time the problem started?

Was any change to the database schema done around the time the problem started?

 

Kenneth Brotman

 

From: Xiangfei Ni [ <ma...@cm-dt.com> mailto:xiangfei.ni@cm-dt.com] 
Sent: Wednesday, March 28, 2018 4:40 AM
To:  <ma...@cassandra.apache.org> user@cassandra.apache.org
Subject: 答复: 答复: 答复: A node down every day in a 6 nodes cluster

 

Hi Kenneth,

    The cluster has been running for 4 months,

    The problem occurred from last week,

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811|Tel: + 86 27 5024 2516

 

发件人: Kenneth Brotman < <ma...@yahoo.com.INVALID> kenbrotman@yahoo.com.INVALID> 
发送时间: 2018年3月28日 19:34
收件人:  <ma...@cassandra.apache.org> user@cassandra.apache.org
主题: RE: 答复: 答复: A node down every day in a 6 nodes cluster

 

David,

 

How long has the cluster been operating?

How long has the problem been occurring?

 

Kenneth Brotman

 

From: Jeff Jirsa [ <ma...@gmail.com> mailto:jjirsa@gmail.com] 
Sent: Tuesday, March 27, 2018 7:00 PM
To: Xiangfei Ni
Cc:  <ma...@cassandra.apache.org> user@cassandra.apache.org
Subject: Re: 答复: 答复: A node down every day in a 6 nodes cluster

 

 

java.langOutOfMemoryError: Java heap space

 

 

You’re oom’ ing 

 

-- 

Jeff Jirsa

 


On Mar 27, 2018, at 6:45 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:

Hi Jeff,

    Today another node was shutdown,I have attached the exception log file,could you please help to analyze?Thanks.

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811|Tel: + 86 27 5024 2516

 

发件人: Jeff Jirsa < <ma...@gmail.com> jjirsa@gmail.com> 
发送时间: 2018年3月27日 11:50
收件人: Xiangfei Ni < <ma...@cm-dt.com> xiangfei.ni@cm-dt.com>
抄送:  <ma...@cassandra.apache.org> user@cassandra.apache.org
主题: Re: 答复: A node down every day in a 6 nodes cluster

 

Only one node having the problem is suspicious. May be that your application is improperly pooling connections, or you have a hardware problem.

 

I dont see anything in nodetool that explains it, though you certainly have a data model likely to cause problems over time (the cardinality of 

rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver is such that you have very wide partitions and it'll be difficult to read).
 
 

 

On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:

Hi Jeff,

    I need to restart the node manually every time,only one node has this problem.

    I have attached the nodetool output,thanks.

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob:  <tel:+86%20137%209700%207811> +86 13797007811|Tel:  <tel:+86%2027%205024%202516> + 86 27 5024 2516

 

发件人: Jeff Jirsa < <ma...@gmail.com> jjirsa@gmail.com> 
发送时间: 2018年3月27日 11:03
收件人:  <ma...@cassandra.apache.org> user@cassandra.apache.org
主题: Re: A node down every day in a 6 nodes cluster

 

That warning isn’t sufficient to understand why the node is going down

 

 

Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is likely a good idea

 

Are the nodes coming up on their own? Or are you restarting them?

 

Paste the output of nodetool tpstats and nodetool cfstats

 

 

 

-- 

Jeff Jirsa

 


On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:

Hi Cassandra experts,

  I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster is just in one DC,

  Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business CF is 3,a node downs one time every day,the system.log shows below info:

WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>

ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 QueryMessage.java:128 - Unexpected error during query

com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-180.jar:na]

        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]

        at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessagejava:115) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]

        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]

        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]

        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]

Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.jar:3.9]

        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]

        .. 26 common frames omitted

Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93) ~[apache-cassandra-3.9.jar:3.9]

        .. 32 common frames omitted

WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>

ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 QueryMessage.java:128 - Unexpected error during query

com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

 

I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:

cassandra@cqlsh:system_auth> select * from role_permissions where role = 'nev_tsp_sa';

 

role       | resource          | permissions

------------+-------------------+--------------------------------------------------------------

nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'}

 

the cache disk can be read/write as normal.

 

Highly appreciated if anyone can help,thanks very much !

 

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob:  <tel:+86%20137%209700%207811> +86 13797007811|Tel:  <tel:+86%2027%205024%202516> + 86 27 5024 2516

 

 

<log.txt>


RE: 答复: 答复: A node down every day in a 6 nodes cluster

Posted by Kenneth Brotman <ke...@yahoo.com.INVALID>.
If you think that will fix the problem, maybe you could add a little more memory to each machine as a short term fix.

 

From: Xiangfei Ni [mailto:xiangfei.ni@cm-dt.com] 
Sent: Wednesday, March 28, 2018 5:24 AM
To: user@cassandra.apache.org
Subject: 答复: 答复: 答复: A node down every day in a 6 nodes cluster

 

Yes ,we discussed and plan to figured out the data model issue and upgrade to 3.11.3 version.

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811|Tel: + 86 27 5024 2516

 

发件人: Kenneth Brotman <ke...@yahoo.com.INVALID> 
发送时间: 2018年3月28日 20:16
收件人: user@cassandra.apache.org
主题: RE: 答复: 答复: A node down every day in a 6 nodes cluster

 

David, 

 

Did you figure out what to do about the data model problem?  It could be that your data files finally grow to the point that the data model problem caused the Java heap space issue – in which case everything is actually working as it’s supposed to; You just have to fix the data model.

 

Kenneth Brotman

 

 

From: Kenneth Brotman [ <ma...@yahoo.com> mailto:kenbrotman@yahoo.com] 
Sent: Wednesday, March 28, 2018 4:46 AM
To: 'user@cassandra.apache.org'
Subject: RE: 答复: 答复: A node down every day in a 6 nodes cluster

 

Was any change to hardware done around the time the problem started ?

Was any change to the client software done around the time the problem started?

Was any change to the database schema done around the time the problem started?

 

Kenneth Brotman

 

From: Xiangfei Ni [ <ma...@cm-dt.com> mailto:xiangfei.ni@cm-dt.com] 
Sent: Wednesday, March 28, 2018 4:40 AM
To:  <ma...@cassandra.apache.org> user@cassandra.apache.org
Subject: 答复: 答复: 答复: A node down every day in a 6 nodes cluster

 

Hi Kenneth,

    The cluster has been running for 4 months,

    The problem occurred from last week,

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811|Tel: + 86 27 5024 2516

 

发件人: Kenneth Brotman < <ma...@yahoo.com.INVALID> kenbrotman@yahoo.com.INVALID> 
发送时间: 2018年3月28日 19:34
收件人:  <ma...@cassandra.apache.org> user@cassandra.apache.org
主题: RE: 答复: 答复: A node down every day in a 6 nodes cluster

 

David,

 

How long has the cluster been operating?

How long has the problem been occurring?

 

Kenneth Brotman

 

From: Jeff Jirsa [ <ma...@gmail.com> mailto:jjirsa@gmail.com] 
Sent: Tuesday, March 27, 2018 7:00 PM
To: Xiangfei Ni
Cc:  <ma...@cassandra.apache.org> user@cassandra.apache.org
Subject: Re: 答复: 答复: A node down every day in a 6 nodes cluster

 

 

java.langOutOfMemoryError: Java heap space

 

 

You’re oom’ ing 

 

-- 

Jeff Jirsa

 


On Mar 27, 2018, at 6:45 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:

Hi Jeff,

    Today another node was shutdown,I have attached the exception log file,could you please help to analyze?Thanks.

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811|Tel: + 86 27 5024 2516

 

发件人: Jeff Jirsa < <ma...@gmail.com> jjirsa@gmail.com> 
发送时间: 2018年3月27日 11:50
收件人: Xiangfei Ni < <ma...@cm-dt.com> xiangfei.ni@cm-dt.com>
抄送:  <ma...@cassandra.apache.org> user@cassandra.apache.org
主题: Re: 答复: A node down every day in a 6 nodes cluster

 

Only one node having the problem is suspicious. May be that your application is improperly pooling connections, or you have a hardware problem.

 

I dont see anything in nodetool that explains it, though you certainly have a data model likely to cause problems over time (the cardinality of 

rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver is such that you have very wide partitions and it'll be difficult to read).
 
 

 

On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:

Hi Jeff,

    I need to restart the node manually every time,only one node has this problem.

    I have attached the nodetool output,thanks.

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob:  <tel:+86%20137%209700%207811> +86 13797007811|Tel:  <tel:+86%2027%205024%202516> + 86 27 5024 2516

 

发件人: Jeff Jirsa < <ma...@gmail.com> jjirsa@gmail.com> 
发送时间: 2018年3月27日 11:03
收件人:  <ma...@cassandra.apache.org> user@cassandra.apache.org
主题: Re: A node down every day in a 6 nodes cluster

 

That warning isn’t sufficient to understand why the node is going down

 

 

Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is likely a good idea

 

Are the nodes coming up on their own? Or are you restarting them?

 

Paste the output of nodetool tpstats and nodetool cfstats

 

 

 

-- 

Jeff Jirsa

 


On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:

Hi Cassandra experts,

  I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster is just in one DC,

  Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business CF is 3,a node downs one time every day,the system.log shows below info:

WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>

ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 QueryMessage.java:128 - Unexpected error during query

com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-180.jar:na]

        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]

        at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]

        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]

        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]

        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]

Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.jar:3.9]

        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]

        ... 26 common frames omitted

Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93) ~[apache-cassandra-3.9.jar:3.9]

        ... 32 common frames omitted

WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>

ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 QueryMessage.java:128 - Unexpected error during query

com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

 

I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:

cassandra@cqlsh:system_auth> select * from role_permissions where role = 'nev_tsp_sa';

 

role       | resource          | permissions

------------+-------------------+--------------------------------------------------------------

nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'}

 

the cache disk can be read/write as normal.

 

Highly appreciated if anyone can help,thanks very much !

 

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob:  <tel:+86%20137%209700%207811> +86 13797007811|Tel:  <tel:+86%2027%205024%202516> + 86 27 5024 2516

 

 

<log.txt>


答复: 答复: 答复: A node down every day in a 6 nodes cluster

Posted by Xiangfei Ni <xi...@cm-dt.com>.
Yes ,we discussed and plan to figured out the data model issue and upgrade to 3.11.3 version.

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811|Tel: + 86 27 5024 2516

发件人: Kenneth Brotman <ke...@yahoo.com.INVALID>
发送时间: 2018年3月28日 20:16
收件人: user@cassandra.apache.org
主题: RE: 答复: 答复: A node down every day in a 6 nodes cluster

David,

Did you figure out what to do about the data model problem?  It could be that your data files finally grow to the point that the data model problem caused the Java heap space issue – in which case everything is actually working as it’s supposed to; You just have to fix the data model.

Kenneth Brotman


From: Kenneth Brotman [mailto:kenbrotman@yahoo.com]
Sent: Wednesday, March 28, 2018 4:46 AM
To: 'user@cassandra.apache.org'
Subject: RE: 答复: 答复: A node down every day in a 6 nodes cluster

Was any change to hardware done around the time the problem started ?
Was any change to the client software done around the time the problem started?
Was any change to the database schema done around the time the problem started?

Kenneth Brotman

From: Xiangfei Ni [mailto:xiangfei.ni@cm-dt.com]
Sent: Wednesday, March 28, 2018 4:40 AM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: 答复: 答复: 答复: A node down every day in a 6 nodes cluster

Hi Kenneth,
    The cluster has been running for 4 months,
    The problem occurred from last week,

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811|Tel: + 86 27 5024 2516

发件人: Kenneth Brotman <ke...@yahoo.com.INVALID>>
发送时间: 2018年3月28日 19:34
收件人: user@cassandra.apache.org<ma...@cassandra.apache.org>
主题: RE: 答复: 答复: A node down every day in a 6 nodes cluster

David,

How long has the cluster been operating?
How long has the problem been occurring?

Kenneth Brotman

From: Jeff Jirsa [mailto:jjirsa@gmail.com]
Sent: Tuesday, March 27, 2018 7:00 PM
To: Xiangfei Ni
Cc: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: 答复: 答复: A node down every day in a 6 nodes cluster




java.langOutOfMemoryError: Java heap space





You’re oom’ ing

--
Jeff Jirsa


On Mar 27, 2018, at 6:45 PM, Xiangfei Ni <xi...@cm-dt.com>> wrote:
Hi Jeff,
    Today another node was shutdown,I have attached the exception log file,could you please help to analyze?Thanks.

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811|Tel: + 86 27 5024 2516

发件人: Jeff Jirsa <jj...@gmail.com>>
发送时间: 2018年3月27日 11:50
收件人: Xiangfei Ni <xi...@cm-dt.com>>
抄送: user@cassandra.apache.org<ma...@cassandra.apache.org>
主题: Re: 答复: A node down every day in a 6 nodes cluster

Only one node having the problem is suspicious. May be that your application is improperly pooling connections, or you have a hardware problem.

I dont see anything in nodetool that explains it, though you certainly have a data model likely to cause problems over time (the cardinality of

rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver is such that you have very wide partitions and it'll be difficult to read).





On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni <xi...@cm-dt.com>> wrote:
Hi Jeff,
    I need to restart the node manually every time,only one node has this problem.
    I have attached the nodetool output,thanks.

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811<tel:+86%20137%209700%207811>|Tel: + 86 27 5024 2516<tel:+86%2027%205024%202516>

发件人: Jeff Jirsa <jj...@gmail.com>>
发送时间: 2018年3月27日 11:03
收件人: user@cassandra.apache.org<ma...@cassandra.apache.org>
主题: Re: A node down every day in a 6 nodes cluster

That warning isn’t sufficient to understand why the node is going down


Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is likely a good idea

Are the nodes coming up on their own? Or are you restarting them?

Paste the output of nodetool tpstats and nodetool cfstats



--
Jeff Jirsa


On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xi...@cm-dt.com>> wrote:
Hi Cassandra experts,
  I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster is just in one DC,
  Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business CF is 3,a node downs one time every day,the system.log shows below info:
WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>
ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-180.jar:na]
        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]
        at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.jar:3.9]
        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]
        ... 26 common frames omitted
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93) ~[apache-cassandra-3.9.jar:3.9]
        ... 32 common frames omitted
WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>
ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:
cassandra@cqlsh:system_auth> select * from role_permissions where role = 'nev_tsp_sa';

role       | resource          | permissions
------------+-------------------+--------------------------------------------------------------
nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'}

the cache disk can be read/write as normal.

Highly appreciated if anyone can help,thanks very much !


Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811<tel:+86%20137%209700%207811>|Tel: + 86 27 5024 2516<tel:+86%2027%205024%202516>


<log.txt>

RE: 答复: 答复: A node down every day in a 6 nodes cluster

Posted by Kenneth Brotman <ke...@yahoo.com.INVALID>.
David, 

 

Did you figure out what to do about the data model problem?  It could be that your data files finally grow to the point that the data model problem caused the Java heap space issue – in which case everything is actually working as it’s supposed to; You just have to fix the data model.

 

Kenneth Brotman

 

 

From: Kenneth Brotman [mailto:kenbrotman@yahoo.com] 
Sent: Wednesday, March 28, 2018 4:46 AM
To: 'user@cassandra.apache.org'
Subject: RE: 答复: 答复: A node down every day in a 6 nodes cluster

 

Was any change to hardware done around the time the problem started ?

Was any change to the client software done around the time the problem started?

Was any change to the database schema done around the time the problem started?

 

Kenneth Brotman

 

From: Xiangfei Ni [mailto:xiangfei.ni@cm-dt.com] 
Sent: Wednesday, March 28, 2018 4:40 AM
To: user@cassandra.apache.org
Subject: 答复: 答复: 答复: A node down every day in a 6 nodes cluster

 

Hi Kenneth,

    The cluster has been running for 4 months,

    The problem occurred from last week,

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811|Tel: + 86 27 5024 2516

 

发件人: Kenneth Brotman <ke...@yahoo.com.INVALID> 
发送时间: 2018年3月28日 19:34
收件人: user@cassandra.apache.org
主题: RE: 答复: 答复: A node down every day in a 6 nodes cluster

 

David,

 

How long has the cluster been operating?

How long has the problem been occurring?

 

Kenneth Brotman

 

From: Jeff Jirsa [mailto:jjirsa@gmail.com] 
Sent: Tuesday, March 27, 2018 7:00 PM
To: Xiangfei Ni
Cc: user@cassandra.apache.org
Subject: Re: 答复: 答复: A node down every day in a 6 nodes cluster

 

 

java.langOutOfMemoryError: Java heap space

 

 

You’re oom’ ing 

 

-- 

Jeff Jirsa

 


On Mar 27, 2018, at 6:45 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:

Hi Jeff,

    Today another node was shutdown,I have attached the exception log file,could you please help to analyze?Thanks.

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811|Tel: + 86 27 5024 2516

 

发件人: Jeff Jirsa <jj...@gmail.com> 
发送时间: 2018年3月27日 11:50
收件人: Xiangfei Ni <xi...@cm-dt.com>
抄送: user@cassandra.apache.org
主题: Re: 答复: A node down every day in a 6 nodes cluster

 

Only one node having the problem is suspicious. May be that your application is improperly pooling connections, or you have a hardware problem.

 

I dont see anything in nodetool that explains it, though you certainly have a data model likely to cause problems over time (the cardinality of 

rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver is such that you have very wide partitions and it'll be difficult to read).
 
 

 

On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:

Hi Jeff,

    I need to restart the node manually every time,only one node has this problem.

    I have attached the nodetool output,thanks.

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811 <tel:+86%20137%209700%207811> |Tel: + 86 27 5024 2516 <tel:+86%2027%205024%202516> 

 

发件人: Jeff Jirsa <jj...@gmail.com> 
发送时间: 2018年3月27日 11:03
收件人: user@cassandra.apache.org
主题: Re: A node down every day in a 6 nodes cluster

 

That warning isn’t sufficient to understand why the node is going down

 

 

Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is likely a good idea

 

Are the nodes coming up on their own? Or are you restarting them?

 

Paste the output of nodetool tpstats and nodetool cfstats

 

 

 

-- 

Jeff Jirsa

 


On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:

Hi Cassandra experts,

  I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster is just in one DC,

  Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business CF is 3,a node downs one time every day,the system.log shows below info:

WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>

ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 QueryMessage.java:128 - Unexpected error during query

com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-180.jar:na]

        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]

        at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]

        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]

        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]

        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]

Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.jar:3.9]

        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]

        ... 26 common frames omitted

Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93) ~[apache-cassandra-3.9.jar:3.9]

        ... 32 common frames omitted

WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>

ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 QueryMessage.java:128 - Unexpected error during query

com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

 

I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:

cassandra@cqlsh:system_auth> select * from role_permissions where role = 'nev_tsp_sa';

 

role       | resource          | permissions

------------+-------------------+--------------------------------------------------------------

nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'}

 

the cache disk can be read/write as normal.

 

Highly appreciated if anyone can help,thanks very much !

 

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811 <tel:+86%20137%209700%207811> |Tel: + 86 27 5024 2516 <tel:+86%2027%205024%202516> 

 

 

<log.txt>


RE: 答复: 答复: A node down every day in a 6 nodes cluster

Posted by Kenneth Brotman <ke...@yahoo.com.INVALID>.
Was any change to hardware done around the time the problem started ?

Was any change to the client software done around the time the problem started?

Was any change to the database schema done around the time the problem started?

 

Kenneth Brotman

 

From: Xiangfei Ni [mailto:xiangfei.ni@cm-dt.com] 
Sent: Wednesday, March 28, 2018 4:40 AM
To: user@cassandra.apache.org
Subject: 答复: 答复: 答复: A node down every day in a 6 nodes cluster

 

Hi Kenneth,

    The cluster has been running for 4 months,

    The problem occurred from last week,

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811|Tel: + 86 27 5024 2516

 

发件人: Kenneth Brotman <ke...@yahoo.com.INVALID> 
发送时间: 2018年3月28日 19:34
收件人: user@cassandra.apache.org
主题: RE: 答复: 答复: A node down every day in a 6 nodes cluster

 

David,

 

How long has the cluster been operating?

How long has the problem been occurring?

 

Kenneth Brotman

 

From: Jeff Jirsa [mailto:jjirsa@gmail.com] 
Sent: Tuesday, March 27, 2018 7:00 PM
To: Xiangfei Ni
Cc: user@cassandra.apache.org
Subject: Re: 答复: 答复: A node down every day in a 6 nodes cluster

 

 

java.langOutOfMemoryError: Java heap space

 

 

You’re oom’ ing 

 

-- 

Jeff Jirsa

 


On Mar 27, 2018, at 6:45 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:

Hi Jeff,

    Today another node was shutdown,I have attached the exception log file,could you please help to analyze?Thanks.

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811|Tel: + 86 27 5024 2516

 

发件人: Jeff Jirsa <jj...@gmail.com> 
发送时间: 2018年3月27日 11:50
收件人: Xiangfei Ni <xi...@cm-dt.com>
抄送: user@cassandra.apache.org
主题: Re: 答复: A node down every day in a 6 nodes cluster

 

Only one node having the problem is suspicious. May be that your application is improperly pooling connections, or you have a hardware problem.

 

I dont see anything in nodetool that explains it, though you certainly have a data model likely to cause problems over time (the cardinality of 

rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver is such that you have very wide partitions and it'll be difficult to read).
 
 

 

On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:

Hi Jeff,

    I need to restart the node manually every time,only one node has this problem.

    I have attached the nodetool output,thanks.

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811 <tel:+86%20137%209700%207811> |Tel: + 86 27 5024 2516 <tel:+86%2027%205024%202516> 

 

发件人: Jeff Jirsa <jj...@gmail.com> 
发送时间: 2018年3月27日 11:03
收件人: user@cassandra.apache.org
主题: Re: A node down every day in a 6 nodes cluster

 

That warning isn’t sufficient to understand why the node is going down

 

 

Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is likely a good idea

 

Are the nodes coming up on their own? Or are you restarting them?

 

Paste the output of nodetool tpstats and nodetool cfstats

 

 

 

-- 

Jeff Jirsa

 


On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:

Hi Cassandra experts,

  I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster is just in one DC,

  Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business CF is 3,a node downs one time every day,the system.log shows below info:

WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>

ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 QueryMessage.java:128 - Unexpected error during query

com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-180.jar:na]

        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]

        at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]

        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]

        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]

        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]

Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.jar:3.9]

        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]

        ... 26 common frames omitted

Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93) ~[apache-cassandra-3.9.jar:3.9]

        ... 32 common frames omitted

WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>

ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 QueryMessage.java:128 - Unexpected error during query

com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

 

I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:

cassandra@cqlsh:system_auth> select * from role_permissions where role = 'nev_tsp_sa';

 

role       | resource          | permissions

------------+-------------------+--------------------------------------------------------------

nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'}

 

the cache disk can be read/write as normal.

 

Highly appreciated if anyone can help,thanks very much !

 

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811 <tel:+86%20137%209700%207811> |Tel: + 86 27 5024 2516 <tel:+86%2027%205024%202516> 

 

 

<log.txt>


答复: 答复: 答复: A node down every day in a 6 nodes cluster

Posted by Xiangfei Ni <xi...@cm-dt.com>.
Hi Kenneth,
    The cluster has been running for 4 months,
    The problem occurred from last week,

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811|Tel: + 86 27 5024 2516

发件人: Kenneth Brotman <ke...@yahoo.com.INVALID>
发送时间: 2018年3月28日 19:34
收件人: user@cassandra.apache.org
主题: RE: 答复: 答复: A node down every day in a 6 nodes cluster

David,

How long has the cluster been operating?
How long has the problem been occurring?

Kenneth Brotman

From: Jeff Jirsa [mailto:jjirsa@gmail.com]
Sent: Tuesday, March 27, 2018 7:00 PM
To: Xiangfei Ni
Cc: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: 答复: 答复: A node down every day in a 6 nodes cluster




java.langOutOfMemoryError: Java heap space





You’re oom’ ing

--
Jeff Jirsa


On Mar 27, 2018, at 6:45 PM, Xiangfei Ni <xi...@cm-dt.com>> wrote:
Hi Jeff,
    Today another node was shutdown,I have attached the exception log file,could you please help to analyze?Thanks.

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811|Tel: + 86 27 5024 2516

发件人: Jeff Jirsa <jj...@gmail.com>>
发送时间: 2018年3月27日 11:50
收件人: Xiangfei Ni <xi...@cm-dt.com>>
抄送: user@cassandra.apache.org<ma...@cassandra.apache.org>
主题: Re: 答复: A node down every day in a 6 nodes cluster

Only one node having the problem is suspicious. May be that your application is improperly pooling connections, or you have a hardware problem.

I dont see anything in nodetool that explains it, though you certainly have a data model likely to cause problems over time (the cardinality of

rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver is such that you have very wide partitions and it'll be difficult to read).





On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni <xi...@cm-dt.com>> wrote:
Hi Jeff,
    I need to restart the node manually every time,only one node has this problem.
    I have attached the nodetool output,thanks.

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811<tel:+86%20137%209700%207811>|Tel: + 86 27 5024 2516<tel:+86%2027%205024%202516>

发件人: Jeff Jirsa <jj...@gmail.com>>
发送时间: 2018年3月27日 11:03
收件人: user@cassandra.apache.org<ma...@cassandra.apache.org>
主题: Re: A node down every day in a 6 nodes cluster

That warning isn’t sufficient to understand why the node is going down


Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is likely a good idea

Are the nodes coming up on their own? Or are you restarting them?

Paste the output of nodetool tpstats and nodetool cfstats



--
Jeff Jirsa


On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xi...@cm-dt.com>> wrote:
Hi Cassandra experts,
  I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster is just in one DC,
  Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business CF is 3,a node downs one time every day,the system.log shows below info:
WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>
ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-180.jar:na]
        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]
        at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.jar:3.9]
        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]
        ... 26 common frames omitted
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93) ~[apache-cassandra-3.9.jar:3.9]
        ... 32 common frames omitted
WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>
ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:
cassandra@cqlsh:system_auth> select * from role_permissions where role = 'nev_tsp_sa';

role       | resource          | permissions
------------+-------------------+--------------------------------------------------------------
nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'}

the cache disk can be read/write as normal.

Highly appreciated if anyone can help,thanks very much !


Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811<tel:+86%20137%209700%207811>|Tel: + 86 27 5024 2516<tel:+86%2027%205024%202516>


<log.txt>

RE: 答复: 答复: A node down every day in a 6 nodes cluster

Posted by Kenneth Brotman <ke...@yahoo.com.INVALID>.
David,

 

How long has the cluster been operating?

How long has the problem been occurring?

 

Kenneth Brotman

 

From: Jeff Jirsa [mailto:jjirsa@gmail.com] 
Sent: Tuesday, March 27, 2018 7:00 PM
To: Xiangfei Ni
Cc: user@cassandra.apache.org
Subject: Re: 答复: 答复: A node down every day in a 6 nodes cluster

 

 

java.langOutOfMemoryError: Java heap space

 

 

You’re oom’ ing 

 

-- 

Jeff Jirsa

 


On Mar 27, 2018, at 6:45 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:

Hi Jeff,

    Today another node was shutdown,I have attached the exception log file,could you please help to analyze?Thanks.

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811|Tel: + 86 27 5024 2516

 

发件人: Jeff Jirsa <jj...@gmail.com> 
发送时间: 2018年3月27日 11:50
收件人: Xiangfei Ni <xi...@cm-dt.com>
抄送: user@cassandra.apache.org
主题: Re: 答复: A node down every day in a 6 nodes cluster

 

Only one node having the problem is suspicious. May be that your application is improperly pooling connections, or you have a hardware problem.

 

I dont see anything in nodetool that explains it, though you certainly have a data model likely to cause problems over time (the cardinality of 

rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver is such that you have very wide partitions and it'll be difficult to read).
 
 

 

On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:

Hi Jeff,

    I need to restart the node manually every time,only one node has this problem.

    I have attached the nodetool output,thanks.

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811 <tel:+86%20137%209700%207811> |Tel: + 86 27 5024 2516 <tel:+86%2027%205024%202516> 

 

发件人: Jeff Jirsa <jj...@gmail.com> 
发送时间: 2018年3月27日 11:03
收件人: user@cassandra.apache.org
主题: Re: A node down every day in a 6 nodes cluster

 

That warning isn’t sufficient to understand why the node is going down

 

 

Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is likely a good idea

 

Are the nodes coming up on their own? Or are you restarting them?

 

Paste the output of nodetool tpstats and nodetool cfstats

 

 

 

-- 

Jeff Jirsa

 


On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:

Hi Cassandra experts,

  I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster is just in one DC,

  Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business CF is 3,a node downs one time every day,the system.log shows below info:

WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>

ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 QueryMessage.java:128 - Unexpected error during query

com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-180.jar:na]

        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]

        at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]

        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]

        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]

        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]

Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.jar:3.9]

        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]

        ... 26 common frames omitted

Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93) ~[apache-cassandra-3.9.jar:3.9]

        ... 32 common frames omitted

WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>

ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 QueryMessage.java:128 - Unexpected error during query

com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

 

I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:

cassandra@cqlsh:system_auth> select * from role_permissions where role = 'nev_tsp_sa';

 

role       | resource          | permissions

------------+-------------------+--------------------------------------------------------------

nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'}

 

the cache disk can be read/write as normal.

 

Highly appreciated if anyone can help,thanks very much !

 

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811 <tel:+86%20137%209700%207811> |Tel: + 86 27 5024 2516 <tel:+86%2027%205024%202516> 

 

 

<log.txt>


Re: 答复: 答复: A node down every day in a 6 nodes cluster

Posted by Jeff Jirsa <jj...@gmail.com>.
java.lang.OutOfMemoryError: Java heap space


You’re oom’ ing 

-- 
Jeff Jirsa


> On Mar 27, 2018, at 6:45 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:
> 
> Hi Jeff,
>     Today another node was shutdown,I have attached the exception log file,could you please help to analyze?Thanks.
>  
> Best Regards,
>  
> 倪项菲/ David Ni
> 中移德电网络科技有限公司
> Virtue Intelligent Network Ltd, co.
> 
> Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
> Mob: +86 13797007811|Tel: + 86 27 5024 2516
>  
> 发件人: Jeff Jirsa <jj...@gmail.com> 
> 发送时间: 2018年3月27日 11:50
> 收件人: Xiangfei Ni <xi...@cm-dt.com>
> 抄送: user@cassandra.apache.org
> 主题: Re: 答复: A node down every day in a 6 nodes cluster
>  
> Only one node having the problem is suspicious. May be that your application is improperly pooling connections, or you have a hardware problem.
>  
> I dont see anything in nodetool that explains it, though you certainly have a data model likely to cause problems over time (the cardinality of 
> rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver is such that you have very wide partitions and it'll be difficult to read).
>  
>  
>  
> On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:
> Hi Jeff,
>     I need to restart the node manually every time,only one node has this problem.
>     I have attached the nodetool output,thanks.
>  
> Best Regards,
>  
> 倪项菲/ David Ni
> 中移德电网络科技有限公司
> Virtue Intelligent Network Ltd, co.
> 
> Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
> Mob: +86 13797007811|Tel: + 86 27 5024 2516
>  
> 发件人: Jeff Jirsa <jj...@gmail.com> 
> 发送时间: 2018年3月27日 11:03
> 收件人: user@cassandra.apache.org
> 主题: Re: A node down every day in a 6 nodes cluster
>  
> That warning isn’t sufficient to understand why the node is going down
>  
>  
> Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is likely a good idea
>  
> Are the nodes coming up on their own? Or are you restarting them?
>  
> Paste the output of nodetool tpstats and nodetool cfstats
>  
>  
>  
> 
> -- 
> Jeff Jirsa
>  
> 
> On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:
> 
> Hi Cassandra experts,
>   I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster is just in one DC,
>   Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business CF is 3,a node downs one time every day,the system.log shows below info:
> WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>
> ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 QueryMessage.java:128 - Unexpected error during query
> com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
>         at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]
>         at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-18.0.jar:na]
>         at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]
>         at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]
>         at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]
>         at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final]
>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final]
>         at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final]
>         at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final]
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
>         at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
> Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
>         at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.jar:3.9]
>         at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]
>         at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]
>         at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]
>         at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]
>         ... 26 common frames omitted
> Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
>         at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93) ~[apache-cassandra-3.9.jar:3.9]
>         ... 32 common frames omitted
> WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>
> ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 QueryMessage.java:128 - Unexpected error during query
> com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
>         at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]
>  
> I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:
> cassandra@cqlsh:system_auth> select * from role_permissions where role = 'nev_tsp_sa';
>  
> role       | resource          | permissions
> ------------+-------------------+--------------------------------------------------------------
> nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'}
>  
> the cache disk can be read/write as normal.
>  
> Highly appreciated if anyone can help,thanks very much !
>  
>  
> Best Regards,
>  
> 倪项菲/ David Ni
> 中移德电网络科技有限公司
> Virtue Intelligent Network Ltd, co.
> 
> Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
> Mob: +86 13797007811|Tel: + 86 27 5024 2516
>  
>  
> <log.txt>

答复: 答复: A node down every day in a 6 nodes cluster

Posted by Xiangfei Ni <xi...@cm-dt.com>.
Another thins is that I have removed the index which has the wide partition rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver as you pointed out.

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811|Tel: + 86 27 5024 2516

发件人: Xiangfei Ni <xi...@cm-dt.com>
发送时间: 2018年3月28日 9:45
收件人: Jeff Jirsa <jj...@gmail.com>
抄送: user@cassandra.apache.org
主题: 答复: 答复: A node down every day in a 6 nodes cluster

Hi Jeff,
    Today another node was shutdown,I have attached the exception log file,could you please help to analyze?Thanks.

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811|Tel: + 86 27 5024 2516

发件人: Jeff Jirsa <jj...@gmail.com>>
发送时间: 2018年3月27日 11:50
收件人: Xiangfei Ni <xi...@cm-dt.com>>
抄送: user@cassandra.apache.org<ma...@cassandra.apache.org>
主题: Re: 答复: A node down every day in a 6 nodes cluster

Only one node having the problem is suspicious. May be that your application is improperly pooling connections, or you have a hardware problem.

I dont see anything in nodetool that explains it, though you certainly have a data model likely to cause problems over time (the cardinality of

rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver is such that you have very wide partitions and it'll be difficult to read).





On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni <xi...@cm-dt.com>> wrote:
Hi Jeff,
    I need to restart the node manually every time,only one node has this problem.
    I have attached the nodetool output,thanks.

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811<tel:+86%20137%209700%207811>|Tel: + 86 27 5024 2516<tel:+86%2027%205024%202516>

发件人: Jeff Jirsa <jj...@gmail.com>>
发送时间: 2018年3月27日 11:03
收件人: user@cassandra.apache.org<ma...@cassandra.apache.org>
主题: Re: A node down every day in a 6 nodes cluster

That warning isn’t sufficient to understand why the node is going down


Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is likely a good idea

Are the nodes coming up on their own? Or are you restarting them?

Paste the output of nodetool tpstats and nodetool cfstats



--
Jeff Jirsa


On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xi...@cm-dt.com>> wrote:
Hi Cassandra experts,
  I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster is just in one DC,
  Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business CF is 3,a node downs one time every day,the system.log shows below info:
WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>
ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]
        at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.jar:3.9]
        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]
        ... 26 common frames omitted
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93) ~[apache-cassandra-3.9.jar:3.9]
        ... 32 common frames omitted
WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>
ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:
cassandra@cqlsh:system_auth> select * from role_permissions where role = 'nev_tsp_sa';

role       | resource          | permissions
------------+-------------------+--------------------------------------------------------------
nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'}

the cache disk can be read/write as normal.

Highly appreciated if anyone can help,thanks very much !


Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811<tel:+86%20137%209700%207811>|Tel: + 86 27 5024 2516<tel:+86%2027%205024%202516>



答复: 答复: A node down every day in a 6 nodes cluster

Posted by Xiangfei Ni <xi...@cm-dt.com>.
Hi Jeff,
    Today another node was shutdown,I have attached the exception log file,could you please help to analyze?Thanks.

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811|Tel: + 86 27 5024 2516

发件人: Jeff Jirsa <jj...@gmail.com>
发送时间: 2018年3月27日 11:50
收件人: Xiangfei Ni <xi...@cm-dt.com>
抄送: user@cassandra.apache.org
主题: Re: 答复: A node down every day in a 6 nodes cluster

Only one node having the problem is suspicious. May be that your application is improperly pooling connections, or you have a hardware problem.

I dont see anything in nodetool that explains it, though you certainly have a data model likely to cause problems over time (the cardinality of

rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver is such that you have very wide partitions and it'll be difficult to read).





On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni <xi...@cm-dt.com>> wrote:
Hi Jeff,
    I need to restart the node manually every time,only one node has this problem.
    I have attached the nodetool output,thanks.

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811<tel:+86%20137%209700%207811>|Tel: + 86 27 5024 2516<tel:+86%2027%205024%202516>

发件人: Jeff Jirsa <jj...@gmail.com>>
发送时间: 2018年3月27日 11:03
收件人: user@cassandra.apache.org<ma...@cassandra.apache.org>
主题: Re: A node down every day in a 6 nodes cluster

That warning isn’t sufficient to understand why the node is going down


Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is likely a good idea

Are the nodes coming up on their own? Or are you restarting them?

Paste the output of nodetool tpstats and nodetool cfstats



--
Jeff Jirsa


On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xi...@cm-dt.com>> wrote:
Hi Cassandra experts,
  I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster is just in one DC,
  Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business CF is 3,a node downs one time every day,the system.log shows below info:
WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>
ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]
        at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.jar:3.9]
        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]
        ... 26 common frames omitted
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93) ~[apache-cassandra-3.9.jar:3.9]
        ... 32 common frames omitted
WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>
ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:
cassandra@cqlsh:system_auth> select * from role_permissions where role = 'nev_tsp_sa';

role       | resource          | permissions
------------+-------------------+--------------------------------------------------------------
nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'}

the cache disk can be read/write as normal.

Highly appreciated if anyone can help,thanks very much !


Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811<tel:+86%20137%209700%207811>|Tel: + 86 27 5024 2516<tel:+86%2027%205024%202516>



RE: 答复: A node down every day in a 6 nodes cluster

Posted by Kenneth Brotman <ke...@yahoo.com.INVALID>.
David,

 

Can you replace the misbehaving node to see if that resolves the problem?

 

Kenneth Brotman

 

From: Xiangfei Ni [mailto:xiangfei.ni@cm-dt.com] 
Sent: Tuesday, March 27, 2018 3:27 AM
To: Jeff Jirsa
Cc: user@cassandra.apache.org
Subject: 答复: 答复: A node down every day in a 6 nodes cluster

 

Thanks Jeff,

           So your suggestion is to first resolve the data model issue which cause wide partition,right?

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811|Tel: + 86 27 5024 2516

 

发件人: Jeff Jirsa <jj...@gmail.com> 
发送时间: 2018年3月27日 11:50
收件人: Xiangfei Ni <xi...@cm-dt.com>
抄送: user@cassandra.apache.org
主题: Re: 答复: A node down every day in a 6 nodes cluster

 

Only one node having the problem is suspicious. May be that your application is improperly pooling connections, or you have a hardware problem.

 

I dont see anything in nodetool that explains it, though you certainly have a data model likely to cause problems over time (the cardinality of 

rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver is such that you have very wide partitions and it'll be difficult to read).
 
 

 

On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:

Hi Jeff,

    I need to restart the node manually every time,only one node has this problem.

    I have attached the nodetool output,thanks.

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811 <tel:+86%20137%209700%207811> |Tel: + 86 27 5024 2516 <tel:+86%2027%205024%202516> 

 

发件人: Jeff Jirsa <jj...@gmail.com> 
发送时间: 2018年3月27日 11:03
收件人: user@cassandra.apache.org
主题: Re: A node down every day in a 6 nodes cluster

 

That warning isn’t sufficient to understand why the node is going down

 

 

Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is likely a good idea

 

Are the nodes coming up on their own? Or are you restarting them?

 

Paste the output of nodetool tpstats and nodetool cfstats

 

 

 

-- 

Jeff Jirsa

 


On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:

Hi Cassandra experts,

  I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster is just in one DC,

  Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business CF is 3,a node downs one time every day,the system.log shows below info:

WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>

ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 QueryMessage.java:128 - Unexpected error during query

com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]

        at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]

        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final]

        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]

        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]

        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]

Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.jar:3.9]

        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]

        ... 26 common frames omitted

Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93) ~[apache-cassandra-3.9.jar:3.9]

        ... 32 common frames omitted

WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>

ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 QueryMessage.java:128 - Unexpected error during query

com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

 

I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:

cassandra@cqlsh:system_auth> select * from role_permissions where role = 'nev_tsp_sa';

 

role       | resource          | permissions

------------+-------------------+--------------------------------------------------------------

nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'}

 

the cache disk can be read/write as normal.

 

Highly appreciated if anyone can help,thanks very much !

 

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811 <tel:+86%20137%209700%207811> |Tel: + 86 27 5024 2516 <tel:+86%2027%205024%202516> 

 

 


答复: 答复: A node down every day in a 6 nodes cluster

Posted by Xiangfei Ni <xi...@cm-dt.com>.
Thanks Jeff,
           So your suggestion is to first resolve the data model issue which cause wide partition,right?

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811|Tel: + 86 27 5024 2516

发件人: Jeff Jirsa <jj...@gmail.com>
发送时间: 2018年3月27日 11:50
收件人: Xiangfei Ni <xi...@cm-dt.com>
抄送: user@cassandra.apache.org
主题: Re: 答复: A node down every day in a 6 nodes cluster

Only one node having the problem is suspicious. May be that your application is improperly pooling connections, or you have a hardware problem.

I dont see anything in nodetool that explains it, though you certainly have a data model likely to cause problems over time (the cardinality of

rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver is such that you have very wide partitions and it'll be difficult to read).





On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni <xi...@cm-dt.com>> wrote:
Hi Jeff,
    I need to restart the node manually every time,only one node has this problem.
    I have attached the nodetool output,thanks.

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811<tel:+86%20137%209700%207811>|Tel: + 86 27 5024 2516<tel:+86%2027%205024%202516>

发件人: Jeff Jirsa <jj...@gmail.com>>
发送时间: 2018年3月27日 11:03
收件人: user@cassandra.apache.org<ma...@cassandra.apache.org>
主题: Re: A node down every day in a 6 nodes cluster

That warning isn’t sufficient to understand why the node is going down


Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is likely a good idea

Are the nodes coming up on their own? Or are you restarting them?

Paste the output of nodetool tpstats and nodetool cfstats



--
Jeff Jirsa


On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xi...@cm-dt.com>> wrote:
Hi Cassandra experts,
  I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster is just in one DC,
  Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business CF is 3,a node downs one time every day,the system.log shows below info:
WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>
ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]
        at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.jar:3.9]
        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]
        ... 26 common frames omitted
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93) ~[apache-cassandra-3.9.jar:3.9]
        ... 32 common frames omitted
WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>
ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:
cassandra@cqlsh:system_auth> select * from role_permissions where role = 'nev_tsp_sa';

role       | resource          | permissions
------------+-------------------+--------------------------------------------------------------
nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'}

the cache disk can be read/write as normal.

Highly appreciated if anyone can help,thanks very much !


Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811<tel:+86%20137%209700%207811>|Tel: + 86 27 5024 2516<tel:+86%2027%205024%202516>



Re: 答复: A node down every day in a 6 nodes cluster

Posted by Jeff Jirsa <jj...@gmail.com>.
Only one node having the problem is suspicious. May be that your
application is improperly pooling connections, or you have a hardware
problem.

I dont see anything in nodetool that explains it, though you certainly have
a data model likely to cause problems over time (the cardinality of

rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver
is such that you have very wide partitions and it'll be difficult to
read).




On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:

> Hi Jeff,
>
>     I need to restart the node manually every time,only one node has this
> problem.
>
>     I have attached the nodetool output,thanks.
>
>
>
> Best Regards,
>
>
>
> 倪项菲*/ **David Ni*
>
> 中移德电网络科技有限公司
>
> Virtue Intelligent Network Ltd, co.
>
> Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
>
> Mob: +86 13797007811 <+86%20137%209700%207811>|Tel: + 86 27 5024 2516
> <+86%2027%205024%202516>
>
>
>
> *发件人:* Jeff Jirsa <jj...@gmail.com>
> *发送时间:* 2018年3月27日 11:03
> *收件人:* user@cassandra.apache.org
> *主题:* Re: A node down every day in a 6 nodes cluster
>
>
>
> That warning isn’t sufficient to understand why the node is going down
>
>
>
>
>
> Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3
> is likely a good idea
>
>
>
> Are the nodes coming up on their own? Or are you restarting them?
>
>
>
> Paste the output of nodetool tpstats and nodetool cfstats
>
>
>
>
>
>
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:
>
> Hi Cassandra experts,
>
>   I am facing an issue,a node downs every day in a 6 nodes cluster,the
> cluster is just in one DC,
>
>   Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m
> HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business
> CF is 3,a node downs one time every day,the system.log shows below info:
>
> WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128
> CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize
> #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>
>
> ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129
> QueryMessage.java:128 - Unexpected error during query
>
> com.google.common.util.concurrent.UncheckedExecutionException:
> java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException:
> Operation timed out - received only 0 responses.
>
>         at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203)
> ~[guava-18.0.jar:na]
>
>         at com.google.common.cache.LocalCache.get(LocalCache.java:3937)
> ~[guava-18.0.jar:na]
>
>         at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941)
> ~[guava-18.0.jar:na]
>
>         at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)
> ~[guava-18.0.jar:na]
>
>         at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.ClientState.
> checkPermissionOnResourceChain(ClientState.java:352)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.cql3.statements.ModificationStatement.
> checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.jar:3.
> 9]
>
>         at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513)
> [apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407)
> [apache-cassandra-3.9.jar:3.9]
>
>         at io.netty.channel.SimpleChannelInboundHandler.channelRead(
> SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.
> 0.39.Final]
>
>         at io.netty.channel.AbstractChannelHandlerContext.
> invokeChannelRead(AbstractChannelHandlerContext.java:366)
> [netty-all-4.0.39.Final.jar:4.0.39.Final]
>
>         at io.netty.channel.AbstractChannelHandlerContext.access$600(
> AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.
> 0.39.Final]
>
>         at io.netty.channel.AbstractChannelHandlerContext$7.run(
> AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.
> 0.39.Final]
>
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [na:1.8.0_91]
>
>         at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorServ
> ice$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
> [apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109)
> [apache-cassandra-3.9.jar:3.9]
>
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
>
> Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException:
> Operation timed out - received only 0 responses.
>
>         at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at com.google.common.cache.LocalCache$LoadingValueReference.
> loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]
>
>         at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319)
> ~[guava-18.0.jar:na]
>
>         at com.google.common.cache.LocalCache$Segment.
> lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]
>
>         at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197)
> ~[guava-18.0.jar:na]
>
>         ... 26 common frames omitted
>
> Caused by: org.apache.cassandra.exceptions.ReadTimeoutException:
> Operation timed out - received only 0 responses.
>
>         at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.StorageProxy$
> SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.db.SinglePartitionReadCommand$
> Group.execute(SinglePartitionReadCommand.java:975)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.cql3.statements.SelectStatement.
> execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.cql3.statements.SelectStatement.
> execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.auth.CassandraAuthorizer.
> addPermissionsForRole(CassandraAuthorizer.java:227)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93)
> ~[apache-cassandra-3.9.jar:3.9]
>
>         ... 32 common frames omitted
>
> WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131
> CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize
> #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>
>
> ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135
> QueryMessage.java:128 - Unexpected error during query
>
> com.google.common.util.concurrent.UncheckedExecutionException:
> java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException:
> Operation timed out - received only 0 responses.
>
>         at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203)
> ~[guava-18.0.jar:na]
>
>
>
> I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:
>
> cassandra@cqlsh:system_auth> select * from role_permissions where role =
> 'nev_tsp_sa';
>
>
>
> role       | resource          | permissions
>
> ------------+-------------------+---------------------------
> -----------------------------------
>
> nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP',
> 'MODIFY', 'SELECT'}
>
>
>
> the cache disk can be read/write as normal.
>
>
>
> Highly appreciated if anyone can help,thanks very much !
>
>
>
>
>
> Best Regards,
>
>
>
> 倪项菲*/ **David Ni*
>
> 中移德电网络科技有限公司
>
> Virtue Intelligent Network Ltd, co.
>
> Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
>
> Mob: +86 13797007811 <+86%20137%209700%207811>|Tel: + 86 27 5024 2516
> <+86%2027%205024%202516>
>
>
>
>

答复: A node down every day in a 6 nodes cluster

Posted by Xiangfei Ni <xi...@cm-dt.com>.
Hi Jeff,
    I need to restart the node manually every time,only one node has this problem.
    I have attached the nodetool output,thanks.

Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811|Tel: + 86 27 5024 2516

发件人: Jeff Jirsa <jj...@gmail.com>
发送时间: 2018年3月27日 11:03
收件人: user@cassandra.apache.org
主题: Re: A node down every day in a 6 nodes cluster

That warning isn’t sufficient to understand why the node is going down


Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is likely a good idea

Are the nodes coming up on their own? Or are you restarting them?

Paste the output of nodetool tpstats and nodetool cfstats



--
Jeff Jirsa


On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xi...@cm-dt.com>> wrote:
Hi Cassandra experts,
  I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster is just in one DC,
  Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business CF is 3,a node downs one time every day,the system.log shows below info:
WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>
ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]
        at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.jar:3.9]
        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]
        ... 26 common frames omitted
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227) ~[apache-cassandra-3.9.jar:3.9]
        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93) ~[apache-cassandra-3.9.jar:3.9]
        ... 32 common frames omitted
WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>
ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 QueryMessage.java:128 - Unexpected error during query
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:
cassandra@cqlsh:system_auth> select * from role_permissions where role = 'nev_tsp_sa';

role       | resource          | permissions
------------+-------------------+--------------------------------------------------------------
nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'}

the cache disk can be read/write as normal.

Highly appreciated if anyone can help,thanks very much !


Best Regards,

倪项菲/ David Ni
中移德电网络科技有限公司
Virtue Intelligent Network Ltd, co.
Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
Mob: +86 13797007811|Tel: + 86 27 5024 2516


Re: A node down every day in a 6 nodes cluster

Posted by Jeff Jirsa <jj...@gmail.com>.
That warning isn’t sufficient to understand why the node is going down


Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is likely a good idea

Are the nodes coming up on their own? Or are you restarting them?

Paste the output of nodetool tpstats and nodetool cfstats




-- 
Jeff Jirsa


> On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xi...@cm-dt.com> wrote:
> 
> Hi Cassandra experts,
>   I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster is just in one DC,
>   Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m HEAP_NEWSIZE=512m,every node load about 200G data,the RF for the business CF is 3,a node downs one time every day,the system.log shows below info:
> WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>
> ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 QueryMessage.java:128 - Unexpected error during query
> com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
>         at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]
>         at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-18.0.jar:na]
>         at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]
>         at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[guava-18.0.jar:na]
>         at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) [apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) [apache-cassandra-3.9.jar:3.9]
>         at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.39.Final.jar:4.0.39.Final]
>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) [netty-all-4.0.39.Final.jar:4.0.39.Final]
>         at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.39.Final.jar:4.0.39.Final]
>         at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357) [netty-all-4.0.39.Final.jar:4.0.39.Final]
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
>         at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
> Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
>         at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.jar:3.9]
>         at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[guava-18.0.jar:na]
>         at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]
>         at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[guava-18.0.jar:na]
>         at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]
>         ... 26 common frames omitted
> Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
>         at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227) ~[apache-cassandra-3.9.jar:3.9]
>         at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93) ~[apache-cassandra-3.9.jar:3.9]
>         ... 32 common frames omitted
> WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 CassandraAuthorizer.java:101 - CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>
> ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 QueryMessage.java:128 - Unexpected error during query
> com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
>         at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]
>  
> I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:
> cassandra@cqlsh:system_auth> select * from role_permissions where role = 'nev_tsp_sa';
>  
> role       | resource          | permissions
> ------------+-------------------+--------------------------------------------------------------
> nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'}
>  
> the cache disk can be read/write as normal.
>  
> Highly appreciated if anyone can help,thanks very much !
>  
>  
> Best Regards,
>  
> 倪项菲/ David Ni
> 中移德电网络科技有限公司
> Virtue Intelligent Network Ltd, co.
> 
> Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
> Mob: +86 13797007811|Tel: + 86 27 5024 2516
>