You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Pavel Polushkin <pp...@enkata.com> on 2012/05/11 16:38:20 UTC

Cassandra stucks

Hello,

 

We faced with a strange problem while testing performance on Cassandra
cluster. After some time all nodes went to down state for several days.
Now all nodes went back to up state and only one node still down.

 

Nodetool on down node throws exception:

Error connection to remote JMX agent!

java.io.IOException: Failed to retrieve RMIServer stub:
javax.naming.CommunicationException [Root exception is
java.rmi.ConnectIOException: error during JRMP connection establishment;
nested exception is:

        java.net.SocketTimeoutException: Read timed out]

        at
javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:340)

        at
javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.
java:248)

        at
org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:144)

        at
org.apache.cassandra.tools.NodeProbe.<init>(NodeProbe.java:114)

        at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:623)

Caused by: javax.naming.CommunicationException [Root exception is
java.rmi.ConnectIOException: error during JRMP connection establishment;
nested exception is:

        java.net.SocketTimeoutException: Read timed out]

        at
com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:10
1)

        at
com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java
:185)

        at javax.naming.InitialContext.lookup(InitialContext.java:392)

        at
javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.
java:1888)

        at
javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java
:1858)

        at
javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:257)

        ... 4 more

Caused by: java.rmi.ConnectIOException: error during JRMP connection
establishment; nested exception is:

        java.net.SocketTimeoutException: Read timed out

        at
sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:286)

        at
sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184)

        at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:322)

        at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source)

        at
com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:97
)

        ... 9 more

Caused by: java.net.SocketTimeoutException: Read timed out

        at java.net.SocketInputStream.socketRead0(Native Method)

        at java.net.SocketInputStream.read(SocketInputStream.java:129)

        at
java.io.BufferedInputStream.fill(BufferedInputStream.java:218)

        at
java.io.BufferedInputStream.read(BufferedInputStream.java:237)

        at java.io.DataInputStream.readByte(DataInputStream.java:248)

        at
sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:228)

        ... 13 more

 

In system log of down node unlimited list of such errors:

INFO [GossipStage:1] 2012-05-10 23:18:27,579 Gossiper.java (line 804)
InetAddress /172.15.2.161 is now UP INFO [GossipStage:1] 2012-05-10
23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.162 is now
UP INFO [GossipStage:1] 2012-05-10 23:18:27,580 Gossiper.java (line 804)
InetAddress /172.15.2.163 is now UP INFO [GossipStage:1] 2012-05-10
23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.165 is now
UP INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818)
InetAddress /172.15.2.161 is now dead.

INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818)
InetAddress /172.15.2.165 is now dead.

INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818)
InetAddress /172.15.2.162 is now dead.

INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818)
InetAddress /172.15.2.163 is now dead.

INFO [GossipStage:1] 2012-05-10 23:18:29,291 Gossiper.java (line 804)
InetAddress /172.15.2.161 is now UP INFO [GossipStage:1] 2012-05-10
23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.162 is now
UP INFO [GossipStage:1] 2012-05-10 23:18:29,292 Gossiper.java (line 804)
InetAddress /172.15.2.163 is now UP INFO [GossipStage:1] 2012-05-10
23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.165 is now
UP

 

The suspicious fact is that on this node we have several tcp connections
to other nodes 7000 port in CLOSE_WAIT state:

Active Internet connections (servers and established)

Proto Recv-Q Send-Q Local Address           Foreign Address
State

tcp   869073      0 rcwocas:afs3-fileserver rcwocas03.enkata.:34274
CLOSE_WAIT

tcp   463429      0 rcwocas:afs3-fileserver rcwocas02.enkata.:39654
CLOSE_WAIT

tcp   873838      0 rcwocas:afs3-fileserver rcwocas01.enkata.:49486
CLOSE_WAIT

tcp   860245      0 rcwocas:afs3-fileserver rcwocas05.enkata.:43028
CLOSE_WAIT

tcp      112      0 rcwocas:afs3-fileserver rcwocas02.enkata.:40321
CLOSE_WAIT

tcp     2124      0 rcwocas:afs3-fileserver rcwocas03.enkata.:39338
CLOSE_WAIT

tcp        0      0 rcwocas:afs3-fileserver rcwocas01.enkata.:56408
ESTABLISHED

tcp      184      0 rcwocas:afs3-fileserver rcwocas01.enkata.:48862
CLOSE_WAIT

tcp   534489      0 rcwocas:afs3-fileserver rcwocas02.enkata.:35331
ESTABLISHED

tcp      886      0 rcwocas:afs3-fileserver rcwocas03.enkata.:56034
CLOSE_WAIT

tcp        0      0 rcwocas04.Enkata.:48800 rcwocas:afs3-fileserver
ESTABLISHED

tcp        0      0 rcwocas:afs3-fileserver rcwocas01.enkata.:51348
ESTABLISHED

tcp      187      0 rcwocas:afs3-fileserver rcwocas05.enkata.:45538
CLOSE_WAIT

tcp      253      0 rcwocas:afs3-fileserver rcwocas03.enkata.:51359
CLOSE_WAIT

 

Also I have attached thread dump

 

Thanks,

Pavel

 

 


Re: Cassandra stucks

Posted by aaron morton <aa...@thelastpickle.com>.
If you are on DSE I would try the DS forums first http://www.datastax.com/support-forums/

Cheers
 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 15/05/2012, at 1:11 AM, Pavel Polushkin wrote:

> Thank you for your reply Aaron!
>  
> Here is more details about our environment:
> 1.      OS – Debian 6.0.4
> root@rcwocas05:~# uname -a
> Linux rcwocas05 2.6.32-5-amd64 #1 SMP Thu Mar 22 17:26:33 UTC 2012 x86_64 GNU/Linux
> 2.      Java
> root@rcwocas05:~# java -version
> java version "1.6.0_26"
> Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
> 3.      Cassandra
> Datastax Enterprise – dse 2.0.1, built on Cassandra 1.0.8
>  
> Now I will try to answer your questions:
> Q: Where are you running nodetool from ?
> A: From the same node
>  
> Q: Do you have any switches or firewalls between the servers ? Could they be closing connections?
> A: All nodes are on the same switch without firewalls. We are using nodetool –h localhost ring, also cluster configured by IP addresses, so it’s seems that DNS should not be the cause of this.
>  
> We didn’t find network problems on cluster. Here is ifconfig that shows that there is no error or dropped packets:
> root@rcwocas05:~# ifconfig
> eth1      Link encap:Ethernet  HWaddr 00:25:90:63:16:c8
>           inet addr:172.15.2.165  Bcast:172.15.2.255  Mask:255.255.255.0
>           inet6 addr: fe80::225:90ff:fe63:16c8/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
>           RX packets:573854310 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:777260637 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:5649087010357 (5.1 TiB)  TX bytes:4024634337167 (3.6 TiB)
>  
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>           RX packets:926558 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:926558 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:220573098 (210.3 MiB)  TX bytes:220573098 (210.3 MiB)
>  
> Seems that network working fine, we tested ping then several nodes were in down state and didn’t find any connection errors.
>  
> Q: Anything interesting with your networking setup
> A: We using standard Cassandra configuration. The only think we configured is IP’s of cluster nodes.
>  
>  
> I would like to give more details:
>  
> 1.      Attaching Cassandra node thread dump when it in down state. You can find that all “network” threads are in WAITING state. Here is example of such threads:
> "Thrift:730" daemon prio=10 tid=0x00007f15bc050800 nid=0x30c5 waiting on condition [0x00007f934e75b000]
>    java.lang.Thread.State: WAITING (parking)
>                 at sun.misc.Unsafe.park(Native Method)
>                 - parking to wait for  <0x00000006810ec428> (a java.util.concurrent.SynchronousQueue$TransferStack)
>                 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>                 at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:422)
>                 at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:323)
>                 at java.util.concurrent.SynchronousQueue.take(SynchronousQueue.java:857)
>                 at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
>                 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
>                 at java.lang.Thread.run(Thread.java:662)
> "WRITE-/172.15.2.162" daemon prio=10 tid=0x0000000000fc8800 nid=0x1bb4 waiting on condition [0x00007f934d082000]
>    java.lang.Thread.State: WAITING (parking)
>                 at sun.misc.Unsafe.park(Native Method)
>                 - parking to wait for  <0x0000000681115318> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>                 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>                 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
>                 at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
>                 at org.apache.cassandra.net.OutboundTcpConnection.take(OutboundTcpConnection.java:198)
>                 at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:90)
> …
>  
> Here is logs from two nodes: Node1 and node 5 when Node5 goes to down state: Node 5 contains exceptions while Node 1 only diagnose that Node 5 went to down state.
>  
> NODE 5 LOG:
> ---
> DEBUG [Thrift:873] 2012-05-14 05:44:03,066 StorageProxy.java (line212) Write timeout java.util.concurrent.TimeoutException for one (or more) of:
> DEBUG [Thrift:873] 2012-05-14 05:44:03,066 CassandraServer.java (line648) ... timed out
> DEBUG [Thrift:873] 2012-05-14 05:44:03,066 CustomTThreadPoolServer.java (line 260) Thrift transport error occurred during processing of message.
> org.apache.thrift.transport.TTransportException: Cannot read. Remote side has closed. Tried to read 4 bytes, but only got 0 bytes. (This is
> often indicative of an internal error on the server side. Please check your server logs.)
>        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
>        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
>        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
>        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
>        at org.apache.cassandra.thrift.Dse$Processor.process(Dse.java:908)
>        at com.datastax.bdp.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:250)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
> DEBUG [Thread-3010] 2012-05-14 05:44:12,007 Gossiper.java (line 222) Reseting version for /172.15.2.164
>  INFO [GossipTasks:1] 2012-05-14 05:44:12,135 Gossiper.java (line 818) InetAddress /172.15.2.161 is now dead.
> DEBUG [GossipTasks:1] 2012-05-14 05:44:12,135 MessagingService.java (line 221) Resetting pool for /172.15.2.161
>  INFO [GossipTasks:1] 2012-05-14 05:44:12,135 Gossiper.java (line 818) InetAddress /172.15.2.162 is now dead.
> DEBUG [GossipTasks:1] 2012-05-14 05:44:12,135 MessagingService.java (line 221) Resetting pool for /172.15.2.162
>  INFO [GossipTasks:1] 2012-05-14 05:44:12,136 Gossiper.java (line 818) InetAddress /172.15.2.163 is now dead.
> DEBUG [GossipTasks:1] 2012-05-14 05:44:12,136 MessagingService.java (line 221) Resetting pool for /172.15.2.163
> DEBUG [Thread-3008] 2012-05-14 05:44:12,625 Gossiper.java (line 222) Reseting version for /172.15.2.163
> DEBUG [Thread-3007] 2012-05-14 05:44:12,626 Gossiper.java (line 222) Reseting version for /172.15.2.161
> DEBUG [Thread-3009] 2012-05-14 05:44:12,682 Gossiper.java (line 222) Reseting version for /172.15.2.162
> DEBUG [WRITE-/172.15.2.163] 2012-05-14 05:44:13,136 OutboundTcpConnection.java (line 211) attempting to connect to
> /172.15.2.163 DEBUG [WRITE-/172.15.2.161] 2012-05-14 05:44:13,136 OutboundTcpConnection.java (line 211) attempting to connect to
> /172.15.2.161  INFO [GossipTasks:1] 2012-05-14 05:44:14,136 Gossiper.java (line 818) InetAddress /172.15.2.164 is now dead.
> DEBUG [GossipTasks:1] 2012-05-14 05:44:14,137 MessagingService.java (line 221) Resetting pool for /172.15.2.164
> DEBUG [WRITE-/172.15.2.164] 2012-05-14 05:44:15,137 OutboundTcpConnection.java (line 211) attempting to connect to
> /172.15.2.164 DEBUG [WRITE-/172.15.2.162] 2012-05-14 05:44:17,138 OutboundTcpConnection.java (line 211) attempting to connect to
> 
> --------
> NODE 1 LOG:
> -------
> DEBUG [MutationStage:7771] 2012-05-14 05:44:03,068 Table.java (line 387) applying mutation of row
> 32303132303231305f677569645f66697273745f31303634323231 DEBUG [Thrift:317] 2012-05-14 05:44:03,068 StorageProxy.java (line
> 212) Write timeout java.util.concurrent.TimeoutException for one (or more) of: DEBUG [Thrift:317] 2012-05-14 05:44:03,068 CassandraServer.java (line 648) ... timed out
> DEBUG [Thrift:317] 2012-05-14 05:44:03,069 CustomTThreadPoolServer.java (line 260) Thrift transport error occurred during processing of message.
> org.apache.thrift.transport.TTransportException: Cannot read. Remote side has closed. Tried to read 4 bytes, but only got 0 bytes. (This is often indicative of an internal error on the server side. Please check your server logs.)
>        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
>        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
>        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
>        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
>        at org.apache.cassandra.thrift.Dse$Processor.process(Dse.java:908)
>        at com.datastax.bdp.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:250)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
> DEBUG [Thrift:317] 2012-05-14 05:44:03,070 ClientState.java (line 104) logged out: #<User allow_all groups=[]>
> DEBUG [RequestResponseStage:7161] 2012-05-14 05:44:03,073 ResponseVerbHandler.java (line 44) Processing response on a callback from 3318177@/172.15.2.164 DEBUG [MutationStage:7772] 2012-05-14 05:44:03,079
> RowMutationVerbHandler.java (line 44) Applying RowMutation(keyspace='Performance90',
> key='32303132303231305f677569645f66697273745f3130363 4323230', modifications=[ColumnFamily(VideoMedia20120210 [file:false:31@1336999691332000,media:false:1048576@1336999691332001,])]) 
> DEBUG [MutationStage:7772] 2012-05-14 05:44:03,079 Table.java (line 387) applying mutation of row 32303132303231305f677569645f66697273745f31303634323230
> DEBUG [MutationStage:7772] 2012-05-14 05:44:03,081 RowMutationVerbHandler.java (line 56) RowMutation(keyspace='Performance90',key='32303132303231305f677569645f66697273745f31303634323230', modifications=[ColumnFamily(VideoMedia20120210[file:false:31@1336999691332000,media:false:1048576@1336999691332001,])])
> applied.  Sending response to 3167090@/172.15.2.162
> DEBUG [MutationStage:7773] 2012-05-14 05:44:03,084 RowMutationVerbHandler.java (line 44) Applying RowMutation(keyspace='Performance90', key='32303132303231305f677569645f66697273745f3130363 4323136', modifications=[ColumnFamily(VideoMedia20120210 [file:false:31@1336999691335000,media:false:1048576@1336999691335001,])])
> DEBUG [MutationStage:7773] 2012-05-14 05:44:03,084 Table.java (line 387) applying mutation of row 32303132303231305f677569645f66697273745f31303634323136
> DEBUG [MutationStage:7773] 2012-05-14 05:44:03,085 RowMutationVerbHandler.java (line 56) RowMutation(keyspace='Performance90', key='32303132303231305f677569645f66697273745f31303634323136',modifications=[ColumnFamily(VideoMedia20120210
> [file:false:31@1336999691335000,media:false:1048576@1336999691335001,])])
> applied.  Sending response to 3074618@/172.15.2.164
> DEBUG [MutationStage:7774] 2012-05-14 05:44:07,026 StorageProxy.java (line 357) Adding hint for /172.15.2.165
> DEBUG [MutationStage:7774] 2012-05-14 05:44:07,026 Table.java (line 387) applying mutation of row 66666666666666666666666666666664
> DEBUG [MutationStage:7774] 2012-05-14 05:44:07,026 SlabAllocator.java (line 105) 1 regions now allocated in
> org.apache.cassandra.utils.SlabAllocator@7dce40cc
> DEBUG [MutationStage:7775] 2012-05-14 05:44:07,027 StorageProxy.java (line 357) Adding hint for /172.15.2.165
> DEBUG [MutationStage:7775] 2012-05-14 05:44:07,027 Table.java (line 387) applying mutation of row 66666666666666666666666666666664
> DEBUG [Thread-1959] 2012-05-14 05:44:12,135 Gossiper.java (line 222) Reseting version for /172.15.2.165
> DEBUG [Thread-1960] 2012-05-14 05:44:12,135 Gossiper.java (line 222) Reseting version for /172.15.2.165
>  INFO [GossipTasks:1] 2012-05-14 05:44:12,626 Gossiper.java (line 818) InetAddress /172.15.2.165 is now dead.
> DEBUG [GossipTasks:1] 2012-05-14 05:44:12,626 MessagingService.java
> (line 221) Resetting pool for /172.15.2.165
>  
> Pavel Polushkin.
> <thread dump.txt>


RE: Cassandra stucks

Posted by Pavel Polushkin <pp...@enkata.com>.
Thank you for your reply Aaron!

 

Here is more details about our environment:

1.      OS - Debian 6.0.4

root@rcwocas05:~# uname -a

Linux rcwocas05 2.6.32-5-amd64 #1 SMP Thu Mar 22 17:26:33 UTC 2012
x86_64 GNU/Linux

2.      Java

root@rcwocas05:~# java -version

java version "1.6.0_26"

Java(TM) SE Runtime Environment (build 1.6.0_26-b03)

Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)

3.      Cassandra 

Datastax Enterprise - dse 2.0.1, built on Cassandra 1.0.8

 

Now I will try to answer your questions:

Q: Where are you running nodetool from ?

A: From the same node

 

Q: Do you have any switches or firewalls between the servers ? Could
they be closing connections?

A: All nodes are on the same switch without firewalls. We are using
nodetool -h localhost ring, also cluster configured by IP addresses, so
it's seems that DNS should not be the cause of this.

 

We didn't find network problems on cluster. Here is ifconfig that shows
that there is no error or dropped packets: 

root@rcwocas05:~# ifconfig

eth1      Link encap:Ethernet  HWaddr 00:25:90:63:16:c8

          inet addr:172.15.2.165  Bcast:172.15.2.255  Mask:255.255.255.0

          inet6 addr: fe80::225:90ff:fe63:16c8/64 Scope:Link

          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1

          RX packets:573854310 errors:0 dropped:0 overruns:0 frame:0

          TX packets:777260637 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1000

          RX bytes:5649087010357 (5.1 TiB)  TX bytes:4024634337167 (3.6
TiB)

 

lo        Link encap:Local Loopback

          inet addr:127.0.0.1  Mask:255.0.0.0

          inet6 addr: ::1/128 Scope:Host

          UP LOOPBACK RUNNING  MTU:16436  Metric:1

          RX packets:926558 errors:0 dropped:0 overruns:0 frame:0

          TX packets:926558 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:0

          RX bytes:220573098 (210.3 MiB)  TX bytes:220573098 (210.3 MiB)

 

Seems that network working fine, we tested ping then several nodes were
in down state and didn't find any connection errors.

 

Q: Anything interesting with your networking setup

A: We using standard Cassandra configuration. The only think we
configured is IP's of cluster nodes.

 

 

I would like to give more details:

 

1.      Attaching Cassandra node thread dump when it in down state. You
can find that all "network" threads are in WAITING state. Here is
example of such threads:

"Thrift:730" daemon prio=10 tid=0x00007f15bc050800 nid=0x30c5 waiting on
condition [0x00007f934e75b000]

   java.lang.Thread.State: WAITING (parking)

                at sun.misc.Unsafe.park(Native Method)

                - parking to wait for  <0x00000006810ec428> (a
java.util.concurrent.SynchronousQueue$TransferStack)

                at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

                at
java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(Synchro
nousQueue.java:422)

                at
java.util.concurrent.SynchronousQueue$TransferStack.transfer(Synchronous
Queue.java:323)

                at
java.util.concurrent.SynchronousQueue.take(SynchronousQueue.java:857)

                at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:
947)

                at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja
va:907)

                at java.lang.Thread.run(Thread.java:662)

"WRITE-/172.15.2.162" daemon prio=10 tid=0x0000000000fc8800 nid=0x1bb4
waiting on condition [0x00007f934d082000]

   java.lang.Thread.State: WAITING (parking)

                at sun.misc.Unsafe.park(Native Method)

                - parking to wait for  <0x0000000681115318> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)

                at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

                at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.aw
ait(AbstractQueuedSynchronizer.java:1987)

                at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:3
99)

                at
org.apache.cassandra.net.OutboundTcpConnection.take(OutboundTcpConnectio
n.java:198)

                at
org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection
.java:90)

...

 

Here is logs from two nodes: Node1 and node 5 when Node5 goes to down
state: Node 5 contains exceptions while Node 1 only diagnose that Node 5
went to down state.

 

NODE 5 LOG:
---
DEBUG [Thrift:873] 2012-05-14 05:44:03,066 StorageProxy.java (line212)
Write timeout java.util.concurrent.TimeoutException for one (or more)
of:
DEBUG [Thrift:873] 2012-05-14 05:44:03,066 CassandraServer.java
(line648) ... timed out
DEBUG [Thrift:873] 2012-05-14 05:44:03,066 CustomTThreadPoolServer.java
(line 260) Thrift transport error occurred during processing of message.
org.apache.thrift.transport.TTransportException: Cannot read. Remote
side has closed. Tried to read 4 bytes, but only got 0 bytes. (This is
often indicative of an internal error on the server side. Please check
your server logs.)
       at
org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
       at
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:
378)
       at
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:
297)
       at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProto
col.java:204)
       at
org.apache.cassandra.thrift.Dse$Processor.process(Dse.java:908)
       at
com.datastax.bdp.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.
run(CustomTThreadPoolServer.java:250)
       at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecuto
r.java:886)
       at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja
va:908)
       at java.lang.Thread.run(Thread.java:662)
DEBUG [Thread-3010] 2012-05-14 05:44:12,007 Gossiper.java (line 222)
Reseting version for /172.15.2.164 <http://172.15.2.164/> 
 INFO [GossipTasks:1] 2012-05-14 05:44:12,135 Gossiper.java (line 818)
InetAddress /172.15.2.161 <http://172.15.2.161/>  is now dead.
DEBUG [GossipTasks:1] 2012-05-14 05:44:12,135 MessagingService.java
(line 221) Resetting pool for /172.15.2.161 <http://172.15.2.161/> 
 INFO [GossipTasks:1] 2012-05-14 05:44:12,135 Gossiper.java (line 818)
InetAddress /172.15.2.162 <http://172.15.2.162/>  is now dead.
DEBUG [GossipTasks:1] 2012-05-14 05:44:12,135 MessagingService.java
(line 221) Resetting pool for /172.15.2.162 <http://172.15.2.162/> 
 INFO [GossipTasks:1] 2012-05-14 05:44:12,136 Gossiper.java (line 818)
InetAddress /172.15.2.163 <http://172.15.2.163/>  is now dead.
DEBUG [GossipTasks:1] 2012-05-14 05:44:12,136 MessagingService.java
(line 221) Resetting pool for /172.15.2.163 <http://172.15.2.163/> 
DEBUG [Thread-3008] 2012-05-14 05:44:12,625 Gossiper.java (line 222)
Reseting version for /172.15.2.163 <http://172.15.2.163/> 
DEBUG [Thread-3007] 2012-05-14 05:44:12,626 Gossiper.java (line 222)
Reseting version for /172.15.2.161 <http://172.15.2.161/> 
DEBUG [Thread-3009] 2012-05-14 05:44:12,682 Gossiper.java (line 222)
Reseting version for /172.15.2.162 <http://172.15.2.162/> 
DEBUG [WRITE-/172.15.2.163 <http://172.15.2.163/> ] 2012-05-14
05:44:13,136 OutboundTcpConnection.java (line 211) attempting to connect
to
/172.15.2.163 <http://172.15.2.163/>  DEBUG [WRITE-/172.15.2.161
<http://172.15.2.161/> ] 2012-05-14 05:44:13,136
OutboundTcpConnection.java (line 211) attempting to connect to
/172.15.2.161 <http://172.15.2.161/>   INFO [GossipTasks:1] 2012-05-14
05:44:14,136 Gossiper.java (line 818) InetAddress /172.15.2.164
<http://172.15.2.164/>  is now dead.
DEBUG [GossipTasks:1] 2012-05-14 05:44:14,137 MessagingService.java
(line 221) Resetting pool for /172.15.2.164 <http://172.15.2.164/> 
DEBUG [WRITE-/172.15.2.164 <http://172.15.2.164/> ] 2012-05-14
05:44:15,137 OutboundTcpConnection.java (line 211) attempting to connect
to
/172.15.2.164 <http://172.15.2.164/>  DEBUG [WRITE-/172.15.2.162
<http://172.15.2.162/> ] 2012-05-14 05:44:17,138
OutboundTcpConnection.java (line 211) attempting to connect to

--------
NODE 1 LOG:
-------
DEBUG [MutationStage:7771] 2012-05-14 05:44:03,068 Table.java (line 387)
applying mutation of row
32303132303231305f677569645f66697273745f31303634323231 DEBUG
[Thrift:317] 2012-05-14 05:44:03,068 StorageProxy.java (line
212) Write timeout java.util.concurrent.TimeoutException for one (or
more) of: DEBUG [Thrift:317] 2012-05-14 05:44:03,068
CassandraServer.java (line 648) ... timed out
DEBUG [Thrift:317] 2012-05-14 05:44:03,069 CustomTThreadPoolServer.java
(line 260) Thrift transport error occurred during processing of message.
org.apache.thrift.transport.TTransportException: Cannot read. Remote
side has closed. Tried to read 4 bytes, but only got 0 bytes. (This is
often indicative of an internal error on the server side. Please check
your server logs.)
       at
org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
       at
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:
378)
       at
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:
297)
       at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProto
col.java:204)
       at
org.apache.cassandra.thrift.Dse$Processor.process(Dse.java:908)
       at
com.datastax.bdp.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.
run(CustomTThreadPoolServer.java:250)
       at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecuto
r.java:886)
       at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja
va:908)
       at java.lang.Thread.run(Thread.java:662)
DEBUG [Thrift:317] 2012-05-14 05:44:03,070 ClientState.java (line 104)
logged out: #<User allow_all groups=[]>
DEBUG [RequestResponseStage:7161] 2012-05-14 05:44:03,073
ResponseVerbHandler.java (line 44) Processing response on a callback
from 3318177@/172.15.2.164 <http://172.15.2.164/>  DEBUG
[MutationStage:7772] 2012-05-14 05:44:03,079
RowMutationVerbHandler.java (line 44) Applying
RowMutation(keyspace='Performance90',
key='32303132303231305f677569645f66697273745f3130363 4323230',
modifications=[ColumnFamily(VideoMedia20120210
[file:false:31@1336999691332000,media:false:1048576@1336999691332001,])]
) 
DEBUG [MutationStage:7772] 2012-05-14 05:44:03,079 Table.java (line 387)
applying mutation of row
32303132303231305f677569645f66697273745f31303634323230
DEBUG [MutationStage:7772] 2012-05-14 05:44:03,081
RowMutationVerbHandler.java (line 56)
RowMutation(keyspace='Performance90',key='32303132303231305f677569645f66
697273745f31303634323230',
modifications=[ColumnFamily(VideoMedia20120210[file:false:31@13369996913
32000,media:false:1048576@1336999691332001,])])
applied.  Sending response to 3167090@/172.15.2.162
<http://172.15.2.162/> 
DEBUG [MutationStage:7773] 2012-05-14 05:44:03,084
RowMutationVerbHandler.java (line 44) Applying
RowMutation(keyspace='Performance90',
key='32303132303231305f677569645f66697273745f3130363 4323136',
modifications=[ColumnFamily(VideoMedia20120210
[file:false:31@1336999691335000,media:false:1048576@1336999691335001,])]
)
DEBUG [MutationStage:7773] 2012-05-14 05:44:03,084 Table.java (line 387)
applying mutation of row
32303132303231305f677569645f66697273745f31303634323136
DEBUG [MutationStage:7773] 2012-05-14 05:44:03,085
RowMutationVerbHandler.java (line 56)
RowMutation(keyspace='Performance90',
key='32303132303231305f677569645f66697273745f31303634323136',modificatio
ns=[ColumnFamily(VideoMedia20120210
[file:false:31@1336999691335000,media:false:1048576@1336999691335001,])]
)
applied.  Sending response to 3074618@/172.15.2.164
<http://172.15.2.164/> 
DEBUG [MutationStage:7774] 2012-05-14 05:44:07,026 StorageProxy.java
(line 357) Adding hint for /172.15.2.165 <http://172.15.2.165/> 
DEBUG [MutationStage:7774] 2012-05-14 05:44:07,026 Table.java (line 387)
applying mutation of row 66666666666666666666666666666664 

DEBUG [MutationStage:7774] 2012-05-14 05:44:07,026 SlabAllocator.java
(line 105) 1 regions now allocated in
org.apache.cassandra.utils.SlabAllocator@7dce40cc
DEBUG [MutationStage:7775] 2012-05-14 05:44:07,027 StorageProxy.java
(line 357) Adding hint for /172.15.2.165 <http://172.15.2.165/> 
DEBUG [MutationStage:7775] 2012-05-14 05:44:07,027 Table.java (line 387)
applying mutation of row 66666666666666666666666666666664 

DEBUG [Thread-1959] 2012-05-14 05:44:12,135 Gossiper.java (line 222)
Reseting version for /172.15.2.165 <http://172.15.2.165/> 
DEBUG [Thread-1960] 2012-05-14 05:44:12,135 Gossiper.java (line 222)
Reseting version for /172.15.2.165 <http://172.15.2.165/> 
 INFO [GossipTasks:1] 2012-05-14 05:44:12,626 Gossiper.java (line 818)
InetAddress /172.15.2.165 <http://172.15.2.165/>  is now dead.
DEBUG [GossipTasks:1] 2012-05-14 05:44:12,626 MessagingService.java
(line 221) Resetting pool for /172.15.2.165 <http://172.15.2.165/> 

 

Pavel Polushkin.


Re: Cassandra stucks

Posted by aaron morton <aa...@thelastpickle.com>.
We've not had any reported issues with connection handing, I would look for other possible reasons first. Out of interest though what OS are you using? and what is the exact JVM version ? 

The javax.naming.CommunicationException (http://docs.oracle.com/javase/6/docs/api/javax/naming/CommunicationException.html) is raised when code cannot talk to the DNS server. Where are you running nodetool from ?

Do you have any switches or firewalls between the servers ? Could they be closing connections ?

Can you match the logs between machines. e.g. what was 172.15.2.163 doing when the node below thought it was dead. 

What cassandra configuration changes have you made to the nodes and is there anything interesting with your networking setup ?

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 12/05/2012, at 4:11 AM, Pavel Polushkin wrote:

> Hello,
> Actually there is no problems with JMX, it works fine when node are in UP state. But after a while cluster goes to inadequate state. For now it seems that it’s a bug of connection handling in Cassandra.
> Pavel.
>  
> From: Madalina Matei [mailto:madalinaimatei@gmail.com] 
> Sent: Friday, May 11, 2012 20:03
> To: user@cassandra.apache.org
> Subject: Re: Cassandra stucks
>  
> Check your JMX port in cassandra-env.sh and see if that's open. 
>  
> Also if you have enabled 
>  
>  JVM_OPTS="$JVM_OPTS -Djava.rmi.server.hostname=<public name>"
>  
> and you are using an ip address for -Djava.rmi.server.hostname make sure that is the correct ip.
>  
>  
> On 11 May 2012, at 16:42, Pavel Polushkin wrote:
> 
> 
> No We are using dedicated phisical hardware. Currently we have 5 nodes.
>  
> From: Madalina Matei [mailto:madalinaimatei@gmail.com] 
> Sent: Friday, May 11, 2012 19:40
> To: user@cassandra.apache.org
> Subject: Re: Cassandra stucks
>  
> Are you using EC2 ?
>  
> On 11 May 2012, at 16:13, Pavel Polushkin wrote:
> 
> 
> 
> We use 1.0.8 version.
>  
> From: David Leimbach [mailto:leimy2k@gmail.com] 
> Sent: Friday, May 11, 2012 18:48
> To: user@cassandra.apache.org
> Subject: Re: Cassandra stucks
>  
> What's the version number of Cassandra?
> 
> On Fri, May 11, 2012 at 7:38 AM, Pavel Polushkin <pp...@enkata.com> wrote:
> Hello,
> 
>  
> 
> We faced with a strange problem while testing performance on Cassandra cluster. After some time all nodes went to down state for several days. Now all nodes went back to up state and only one node still down.
> 
>  
> 
> Nodetool on down node throws exception:
> 
> Error connection to remote JMX agent!
> 
> java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is:
> 
>         java.net.SocketTimeoutException: Read timed out]
> 
>         at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:340)
> 
>         at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248)
> 
>         at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:144)
> 
>         at org.apache.cassandra.tools.NodeProbe.<init>(NodeProbe.java:114)
> 
>         at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:623)
> 
> Caused by: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is:
> 
>         java.net.SocketTimeoutException: Read timed out]
> 
>         at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:101)
> 
>         at com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java:185)
> 
>         at javax.naming.InitialContext.lookup(InitialContext.java:392)
> 
>         at javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.java:1888)
> 
>         at javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java:1858)
> 
>         at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:257)
> 
>         ... 4 more
> 
> Caused by: java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is:
> 
>         java.net.SocketTimeoutException: Read timed out
> 
>         at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:286)
> 
>         at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184)
> 
>         at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:322)
> 
>         at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source)
> 
>         at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:97)
> 
>         ... 9 more
> 
> Caused by: java.net.SocketTimeoutException: Read timed out
> 
>         at java.net.SocketInputStream.socketRead0(Native Method)
> 
>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
> 
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> 
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> 
>         at java.io.DataInputStream.readByte(DataInputStream.java:248)
> 
>         at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:228)
> 
>         ... 13 more
> 
>  
> 
> In system log of down node unlimited list of such errors:
> 
> INFO [GossipStage:1] 2012-05-10 23:18:27,579 Gossiper.java (line 804) InetAddress /172.15.2.161 is now UP INFO [GossipStage:1] 2012-05-10 23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.162 is now UP INFO [GossipStage:1] 2012-05-10 23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.163 is now UP INFO [GossipStage:1] 2012-05-10 23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.165 is now UP INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818) InetAddress /172.15.2.161 is now dead.
> 
> INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818) InetAddress /172.15.2.165 is now dead.
> 
> INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818) InetAddress /172.15.2.162 is now dead.
> 
> INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818) InetAddress /172.15.2.163 is now dead.
> 
> INFO [GossipStage:1] 2012-05-10 23:18:29,291 Gossiper.java (line 804) InetAddress /172.15.2.161 is now UP INFO [GossipStage:1] 2012-05-10 23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.162 is now UP INFO [GossipStage:1] 2012-05-10 23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.163 is now UP INFO [GossipStage:1] 2012-05-10 23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.165 is now UP
> 
>  
> 
> The suspicious fact is that on this node we have several tcp connections to other nodes 7000 port in CLOSE_WAIT state:
> 
> Active Internet connections (servers and established)
> 
> Proto Recv-Q Send-Q Local Address           Foreign Address         State
> 
> tcp   869073      0 rcwocas:afs3-fileserver rcwocas03.enkata.:34274 CLOSE_WAIT
> 
> tcp   463429      0 rcwocas:afs3-fileserver rcwocas02.enkata.:39654 CLOSE_WAIT
> 
> tcp   873838      0 rcwocas:afs3-fileserver rcwocas01.enkata.:49486 CLOSE_WAIT
> 
> tcp   860245      0 rcwocas:afs3-fileserver rcwocas05.enkata.:43028 CLOSE_WAIT
> 
> tcp      112      0 rcwocas:afs3-fileserver rcwocas02.enkata.:40321 CLOSE_WAIT
> 
> tcp     2124      0 rcwocas:afs3-fileserver rcwocas03.enkata.:39338 CLOSE_WAIT
> 
> tcp        0      0 rcwocas:afs3-fileserver rcwocas01.enkata.:56408 ESTABLISHED
> 
> tcp      184      0 rcwocas:afs3-fileserver rcwocas01.enkata.:48862 CLOSE_WAIT
> 
> tcp   534489      0 rcwocas:afs3-fileserver rcwocas02.enkata.:35331 ESTABLISHED
> 
> tcp      886      0 rcwocas:afs3-fileserver rcwocas03.enkata.:56034 CLOSE_WAIT
> 
> tcp        0      0 rcwocas04.Enkata.:48800 rcwocas:afs3-fileserver ESTABLISHED
> 
> tcp        0      0 rcwocas:afs3-fileserver rcwocas01.enkata.:51348 ESTABLISHED
> 
> tcp      187      0 rcwocas:afs3-fileserver rcwocas05.enkata.:45538 CLOSE_WAIT
> 
> tcp      253      0 rcwocas:afs3-fileserver rcwocas03.enkata.:51359 CLOSE_WAIT
> 
>  
> 
> Also I have attached thread dump
> 
>  
> 
> Thanks,
> 
> Pavel
> 
>  
>  
>  


RE: Cassandra stucks

Posted by Pavel Polushkin <pp...@enkata.com>.
Hello, 

Actually there is no problems with JMX, it works fine when node are in
UP state. But after a while cluster goes to inadequate state. For now it
seems that it's a bug of connection handling in Cassandra. 

Pavel.

 

From: Madalina Matei [mailto:madalinaimatei@gmail.com] 
Sent: Friday, May 11, 2012 20:03
To: user@cassandra.apache.org
Subject: Re: Cassandra stucks

 

Check your JMX port in cassandra-env.sh and see if that's open. 

 

Also if you have enabled 

 

 JVM_OPTS="$JVM_OPTS -Djava.rmi.server.hostname=<public name>"

 

and you are using an ip address for -Djava.rmi.server.hostname make sure
that is the correct ip.

 

 

On 11 May 2012, at 16:42, Pavel Polushkin wrote:





No We are using dedicated phisical hardware. Currently we have 5 nodes.

 

From: Madalina Matei [mailto:madalinaimatei@gmail.com] 
Sent: Friday, May 11, 2012 19:40
To: user@cassandra.apache.org
Subject: Re: Cassandra stucks

 

Are you using EC2 ?

 

On 11 May 2012, at 16:13, Pavel Polushkin wrote:






We use 1.0.8 version.

 

From: David Leimbach [mailto:leimy2k@gmail.com] 
Sent: Friday, May 11, 2012 18:48
To: user@cassandra.apache.org
Subject: Re: Cassandra stucks

 

What's the version number of Cassandra?

On Fri, May 11, 2012 at 7:38 AM, Pavel Polushkin <pp...@enkata.com>
wrote:

Hello,

 

We faced with a strange problem while testing performance on Cassandra
cluster. After some time all nodes went to down state for several days.
Now all nodes went back to up state and only one node still down.

 

Nodetool on down node throws exception:

Error connection to remote JMX agent!

java.io.IOException: Failed to retrieve RMIServer stub:
javax.naming.CommunicationException [Root exception is
java.rmi.ConnectIOException: error during JRMP connection establishment;
nested exception is:

        java.net.SocketTimeoutException: Read timed out]

        at
javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:340)

        at
javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.
java:248)

        at
org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:144)

        at
org.apache.cassandra.tools.NodeProbe.<init>(NodeProbe.java:114)

        at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:623)

Caused by: javax.naming.CommunicationException [Root exception is
java.rmi.ConnectIOException: error during JRMP connection establishment;
nested exception is:

        java.net.SocketTimeoutException: Read timed out]

        at
com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:10
1)

        at
com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java
:185)

        at javax.naming.InitialContext.lookup(InitialContext.java:392)

        at
javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.
java:1888)

        at
javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java
:1858)

        at
javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:257)

        ... 4 more

Caused by: java.rmi.ConnectIOException: error during JRMP connection
establishment; nested exception is:

        java.net.SocketTimeoutException: Read timed out

        at
sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:286)

        at
sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184)

        at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:322)

        at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source)

        at
com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:97
)

        ... 9 more

Caused by: java.net.SocketTimeoutException: Read timed out

        at java.net.SocketInputStream.socketRead0(Native Method)

        at java.net.SocketInputStream.read(SocketInputStream.java:129)

        at
java.io.BufferedInputStream.fill(BufferedInputStream.java:218)

        at
java.io.BufferedInputStream.read(BufferedInputStream.java:237)

        at java.io.DataInputStream.readByte(DataInputStream.java:248)

        at
sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:228)

        ... 13 more

 

In system log of down node unlimited list of such errors:

INFO [GossipStage:1] 2012-05-10 23:18:27,579 Gossiper.java (line 804)
InetAddress /172.15.2.161 is now UP INFO [GossipStage:1] 2012-05-10
23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.162 is now
UP INFO [GossipStage:1] 2012-05-10 23:18:27,580 Gossiper.java (line 804)
InetAddress /172.15.2.163 is now UP INFO [GossipStage:1] 2012-05-10
23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.165 is now
UP INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818)
InetAddress /172.15.2.161 is now dead.

INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818)
InetAddress /172.15.2.165 is now dead.

INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818)
InetAddress /172.15.2.162 is now dead.

INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818)
InetAddress /172.15.2.163 is now dead.

INFO [GossipStage:1] 2012-05-10 23:18:29,291 Gossiper.java (line 804)
InetAddress /172.15.2.161 is now UP INFO [GossipStage:1] 2012-05-10
23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.162 is now
UP INFO [GossipStage:1] 2012-05-10 23:18:29,292 Gossiper.java (line 804)
InetAddress /172.15.2.163 is now UP INFO [GossipStage:1] 2012-05-10
23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.165 is now
UP

 

The suspicious fact is that on this node we have several tcp connections
to other nodes 7000 port in CLOSE_WAIT state:

Active Internet connections (servers and established)

Proto Recv-Q Send-Q Local Address           Foreign Address
State

tcp   869073      0 rcwocas:afs3-fileserver rcwocas03.enkata.:34274
CLOSE_WAIT

tcp   463429      0 rcwocas:afs3-fileserver rcwocas02.enkata.:39654
CLOSE_WAIT

tcp   873838      0 rcwocas:afs3-fileserver rcwocas01.enkata.:49486
CLOSE_WAIT

tcp   860245      0 rcwocas:afs3-fileserver rcwocas05.enkata.:43028
CLOSE_WAIT

tcp      112      0 rcwocas:afs3-fileserver rcwocas02.enkata.:40321
CLOSE_WAIT

tcp     2124      0 rcwocas:afs3-fileserver rcwocas03.enkata.:39338
CLOSE_WAIT

tcp        0      0 rcwocas:afs3-fileserver rcwocas01.enkata.:56408
ESTABLISHED

tcp      184      0 rcwocas:afs3-fileserver rcwocas01.enkata.:48862
CLOSE_WAIT

tcp   534489      0 rcwocas:afs3-fileserver rcwocas02.enkata.:35331
ESTABLISHED

tcp      886      0 rcwocas:afs3-fileserver rcwocas03.enkata.:56034
CLOSE_WAIT

tcp        0      0 rcwocas04.Enkata.:48800 rcwocas:afs3-fileserver
ESTABLISHED

tcp        0      0 rcwocas:afs3-fileserver rcwocas01.enkata.:51348
ESTABLISHED

tcp      187      0 rcwocas:afs3-fileserver rcwocas05.enkata.:45538
CLOSE_WAIT

tcp      253      0 rcwocas:afs3-fileserver rcwocas03.enkata.:51359
CLOSE_WAIT

 

Also I have attached thread dump

 

Thanks,

Pavel

 

 

 

 


Re: Cassandra stucks

Posted by Madalina Matei <ma...@gmail.com>.
Check your JMX port in cassandra-env.sh and see if that's open. 

Also if you have enabled 

 JVM_OPTS="$JVM_OPTS -Djava.rmi.server.hostname=<public name>"

and you are using an ip address for -Djava.rmi.server.hostname make sure that is the correct ip.


On 11 May 2012, at 16:42, Pavel Polushkin wrote:

> No We are using dedicated phisical hardware. Currently we have 5 nodes.
>  
> From: Madalina Matei [mailto:madalinaimatei@gmail.com] 
> Sent: Friday, May 11, 2012 19:40
> To: user@cassandra.apache.org
> Subject: Re: Cassandra stucks
>  
> Are you using EC2 ?
>  
> On 11 May 2012, at 16:13, Pavel Polushkin wrote:
> 
> 
> We use 1.0.8 version.
>  
> From: David Leimbach [mailto:leimy2k@gmail.com] 
> Sent: Friday, May 11, 2012 18:48
> To: user@cassandra.apache.org
> Subject: Re: Cassandra stucks
>  
> What's the version number of Cassandra?
> 
> On Fri, May 11, 2012 at 7:38 AM, Pavel Polushkin <pp...@enkata.com> wrote:
> Hello,
> 
>  
> 
> We faced with a strange problem while testing performance on Cassandra cluster. After some time all nodes went to down state for several days. Now all nodes went back to up state and only one node still down.
> 
>  
> 
> Nodetool on down node throws exception:
> 
> Error connection to remote JMX agent!
> 
> java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is:
> 
>         java.net.SocketTimeoutException: Read timed out]
> 
>         at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:340)
> 
>         at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248)
> 
>         at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:144)
> 
>         at org.apache.cassandra.tools.NodeProbe.<init>(NodeProbe.java:114)
> 
>         at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:623)
> 
> Caused by: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is:
> 
>         java.net.SocketTimeoutException: Read timed out]
> 
>         at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:101)
> 
>         at com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java:185)
> 
>         at javax.naming.InitialContext.lookup(InitialContext.java:392)
> 
>         at javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.java:1888)
> 
>         at javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java:1858)
> 
>         at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:257)
> 
>         ... 4 more
> 
> Caused by: java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is:
> 
>         java.net.SocketTimeoutException: Read timed out
> 
>         at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:286)
> 
>         at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184)
> 
>         at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:322)
> 
>         at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source)
> 
>         at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:97)
> 
>         ... 9 more
> 
> Caused by: java.net.SocketTimeoutException: Read timed out
> 
>         at java.net.SocketInputStream.socketRead0(Native Method)
> 
>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
> 
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> 
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> 
>         at java.io.DataInputStream.readByte(DataInputStream.java:248)
> 
>         at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:228)
> 
>         ... 13 more
> 
>  
> 
> In system log of down node unlimited list of such errors:
> 
> INFO [GossipStage:1] 2012-05-10 23:18:27,579 Gossiper.java (line 804) InetAddress /172.15.2.161 is now UP INFO [GossipStage:1] 2012-05-10 23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.162 is now UP INFO [GossipStage:1] 2012-05-10 23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.163 is now UP INFO [GossipStage:1] 2012-05-10 23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.165 is now UP INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818) InetAddress /172.15.2.161 is now dead.
> 
> INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818) InetAddress /172.15.2.165 is now dead.
> 
> INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818) InetAddress /172.15.2.162 is now dead.
> 
> INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818) InetAddress /172.15.2.163 is now dead.
> 
> INFO [GossipStage:1] 2012-05-10 23:18:29,291 Gossiper.java (line 804) InetAddress /172.15.2.161 is now UP INFO [GossipStage:1] 2012-05-10 23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.162 is now UP INFO [GossipStage:1] 2012-05-10 23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.163 is now UP INFO [GossipStage:1] 2012-05-10 23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.165 is now UP
> 
>  
> 
> The suspicious fact is that on this node we have several tcp connections to other nodes 7000 port in CLOSE_WAIT state:
> 
> Active Internet connections (servers and established)
> 
> Proto Recv-Q Send-Q Local Address           Foreign Address         State
> 
> tcp   869073      0 rcwocas:afs3-fileserver rcwocas03.enkata.:34274 CLOSE_WAIT
> 
> tcp   463429      0 rcwocas:afs3-fileserver rcwocas02.enkata.:39654 CLOSE_WAIT
> 
> tcp   873838      0 rcwocas:afs3-fileserver rcwocas01.enkata.:49486 CLOSE_WAIT
> 
> tcp   860245      0 rcwocas:afs3-fileserver rcwocas05.enkata.:43028 CLOSE_WAIT
> 
> tcp      112      0 rcwocas:afs3-fileserver rcwocas02.enkata.:40321 CLOSE_WAIT
> 
> tcp     2124      0 rcwocas:afs3-fileserver rcwocas03.enkata.:39338 CLOSE_WAIT
> 
> tcp        0      0 rcwocas:afs3-fileserver rcwocas01.enkata.:56408 ESTABLISHED
> 
> tcp      184      0 rcwocas:afs3-fileserver rcwocas01.enkata.:48862 CLOSE_WAIT
> 
> tcp   534489      0 rcwocas:afs3-fileserver rcwocas02.enkata.:35331 ESTABLISHED
> 
> tcp      886      0 rcwocas:afs3-fileserver rcwocas03.enkata.:56034 CLOSE_WAIT
> 
> tcp        0      0 rcwocas04.Enkata.:48800 rcwocas:afs3-fileserver ESTABLISHED
> 
> tcp        0      0 rcwocas:afs3-fileserver rcwocas01.enkata.:51348 ESTABLISHED
> 
> tcp      187      0 rcwocas:afs3-fileserver rcwocas05.enkata.:45538 CLOSE_WAIT
> 
> tcp      253      0 rcwocas:afs3-fileserver rcwocas03.enkata.:51359 CLOSE_WAIT
> 
>  
> 
> Also I have attached thread dump
> 
>  
> 
> Thanks,
> 
> Pavel
> 
>  
>  
>  


RE: Cassandra stucks

Posted by Pavel Polushkin <pp...@enkata.com>.
No We are using dedicated phisical hardware. Currently we have 5 nodes.

 

From: Madalina Matei [mailto:madalinaimatei@gmail.com] 
Sent: Friday, May 11, 2012 19:40
To: user@cassandra.apache.org
Subject: Re: Cassandra stucks

 

Are you using EC2 ?

 

On 11 May 2012, at 16:13, Pavel Polushkin wrote:





We use 1.0.8 version.

 

From: David Leimbach [mailto:leimy2k@gmail.com] 
Sent: Friday, May 11, 2012 18:48
To: user@cassandra.apache.org
Subject: Re: Cassandra stucks

 

What's the version number of Cassandra?

On Fri, May 11, 2012 at 7:38 AM, Pavel Polushkin <pp...@enkata.com>
wrote:

Hello,

 

We faced with a strange problem while testing performance on Cassandra
cluster. After some time all nodes went to down state for several days.
Now all nodes went back to up state and only one node still down.

 

Nodetool on down node throws exception:

Error connection to remote JMX agent!

java.io.IOException: Failed to retrieve RMIServer stub:
javax.naming.CommunicationException [Root exception is
java.rmi.ConnectIOException: error during JRMP connection establishment;
nested exception is:

        java.net.SocketTimeoutException: Read timed out]

        at
javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:340)

        at
javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.
java:248)

        at
org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:144)

        at
org.apache.cassandra.tools.NodeProbe.<init>(NodeProbe.java:114)

        at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:623)

Caused by: javax.naming.CommunicationException [Root exception is
java.rmi.ConnectIOException: error during JRMP connection establishment;
nested exception is:

        java.net.SocketTimeoutException: Read timed out]

        at
com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:10
1)

        at
com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java
:185)

        at javax.naming.InitialContext.lookup(InitialContext.java:392)

        at
javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.
java:1888)

        at
javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java
:1858)

        at
javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:257)

        ... 4 more

Caused by: java.rmi.ConnectIOException: error during JRMP connection
establishment; nested exception is:

        java.net.SocketTimeoutException: Read timed out

        at
sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:286)

        at
sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184)

        at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:322)

        at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source)

        at
com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:97
)

        ... 9 more

Caused by: java.net.SocketTimeoutException: Read timed out

        at java.net.SocketInputStream.socketRead0(Native Method)

        at java.net.SocketInputStream.read(SocketInputStream.java:129)

        at
java.io.BufferedInputStream.fill(BufferedInputStream.java:218)

        at
java.io.BufferedInputStream.read(BufferedInputStream.java:237)

        at java.io.DataInputStream.readByte(DataInputStream.java:248)

        at
sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:228)

        ... 13 more

 

In system log of down node unlimited list of such errors:

INFO [GossipStage:1] 2012-05-10 23:18:27,579 Gossiper.java (line 804)
InetAddress /172.15.2.161 is now UP INFO [GossipStage:1] 2012-05-10
23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.162 is now
UP INFO [GossipStage:1] 2012-05-10 23:18:27,580 Gossiper.java (line 804)
InetAddress /172.15.2.163 is now UP INFO [GossipStage:1] 2012-05-10
23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.165 is now
UP INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818)
InetAddress /172.15.2.161 is now dead.

INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818)
InetAddress /172.15.2.165 is now dead.

INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818)
InetAddress /172.15.2.162 is now dead.

INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818)
InetAddress /172.15.2.163 is now dead.

INFO [GossipStage:1] 2012-05-10 23:18:29,291 Gossiper.java (line 804)
InetAddress /172.15.2.161 is now UP INFO [GossipStage:1] 2012-05-10
23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.162 is now
UP INFO [GossipStage:1] 2012-05-10 23:18:29,292 Gossiper.java (line 804)
InetAddress /172.15.2.163 is now UP INFO [GossipStage:1] 2012-05-10
23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.165 is now
UP

 

The suspicious fact is that on this node we have several tcp connections
to other nodes 7000 port in CLOSE_WAIT state:

Active Internet connections (servers and established)

Proto Recv-Q Send-Q Local Address           Foreign Address
State

tcp   869073      0 rcwocas:afs3-fileserver rcwocas03.enkata.:34274
CLOSE_WAIT

tcp   463429      0 rcwocas:afs3-fileserver rcwocas02.enkata.:39654
CLOSE_WAIT

tcp   873838      0 rcwocas:afs3-fileserver rcwocas01.enkata.:49486
CLOSE_WAIT

tcp   860245      0 rcwocas:afs3-fileserver rcwocas05.enkata.:43028
CLOSE_WAIT

tcp      112      0 rcwocas:afs3-fileserver rcwocas02.enkata.:40321
CLOSE_WAIT

tcp     2124      0 rcwocas:afs3-fileserver rcwocas03.enkata.:39338
CLOSE_WAIT

tcp        0      0 rcwocas:afs3-fileserver rcwocas01.enkata.:56408
ESTABLISHED

tcp      184      0 rcwocas:afs3-fileserver rcwocas01.enkata.:48862
CLOSE_WAIT

tcp   534489      0 rcwocas:afs3-fileserver rcwocas02.enkata.:35331
ESTABLISHED

tcp      886      0 rcwocas:afs3-fileserver rcwocas03.enkata.:56034
CLOSE_WAIT

tcp        0      0 rcwocas04.Enkata.:48800 rcwocas:afs3-fileserver
ESTABLISHED

tcp        0      0 rcwocas:afs3-fileserver rcwocas01.enkata.:51348
ESTABLISHED

tcp      187      0 rcwocas:afs3-fileserver rcwocas05.enkata.:45538
CLOSE_WAIT

tcp      253      0 rcwocas:afs3-fileserver rcwocas03.enkata.:51359
CLOSE_WAIT

 

Also I have attached thread dump

 

Thanks,

Pavel

 

 

 


Re: Cassandra stucks

Posted by Madalina Matei <ma...@gmail.com>.
Are you using EC2 ?

On 11 May 2012, at 16:13, Pavel Polushkin wrote:

> We use 1.0.8 version.
>  
> From: David Leimbach [mailto:leimy2k@gmail.com] 
> Sent: Friday, May 11, 2012 18:48
> To: user@cassandra.apache.org
> Subject: Re: Cassandra stucks
>  
> What's the version number of Cassandra?
> 
> On Fri, May 11, 2012 at 7:38 AM, Pavel Polushkin <pp...@enkata.com> wrote:
> Hello,
> 
>  
> 
> We faced with a strange problem while testing performance on Cassandra cluster. After some time all nodes went to down state for several days. Now all nodes went back to up state and only one node still down.
> 
>  
> 
> Nodetool on down node throws exception:
> 
> Error connection to remote JMX agent!
> 
> java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is:
> 
>         java.net.SocketTimeoutException: Read timed out]
> 
>         at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:340)
> 
>         at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248)
> 
>         at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:144)
> 
>         at org.apache.cassandra.tools.NodeProbe.<init>(NodeProbe.java:114)
> 
>         at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:623)
> 
> Caused by: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is:
> 
>         java.net.SocketTimeoutException: Read timed out]
> 
>         at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:101)
> 
>         at com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java:185)
> 
>         at javax.naming.InitialContext.lookup(InitialContext.java:392)
> 
>         at javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.java:1888)
> 
>         at javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java:1858)
> 
>         at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:257)
> 
>         ... 4 more
> 
> Caused by: java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is:
> 
>         java.net.SocketTimeoutException: Read timed out
> 
>         at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:286)
> 
>         at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184)
> 
>         at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:322)
> 
>         at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source)
> 
>         at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:97)
> 
>         ... 9 more
> 
> Caused by: java.net.SocketTimeoutException: Read timed out
> 
>         at java.net.SocketInputStream.socketRead0(Native Method)
> 
>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
> 
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> 
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> 
>         at java.io.DataInputStream.readByte(DataInputStream.java:248)
> 
>         at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:228)
> 
>         ... 13 more
> 
>  
> 
> In system log of down node unlimited list of such errors:
> 
> INFO [GossipStage:1] 2012-05-10 23:18:27,579 Gossiper.java (line 804) InetAddress /172.15.2.161 is now UP INFO [GossipStage:1] 2012-05-10 23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.162 is now UP INFO [GossipStage:1] 2012-05-10 23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.163 is now UP INFO [GossipStage:1] 2012-05-10 23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.165 is now UP INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818) InetAddress /172.15.2.161 is now dead.
> 
> INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818) InetAddress /172.15.2.165 is now dead.
> 
> INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818) InetAddress /172.15.2.162 is now dead.
> 
> INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818) InetAddress /172.15.2.163 is now dead.
> 
> INFO [GossipStage:1] 2012-05-10 23:18:29,291 Gossiper.java (line 804) InetAddress /172.15.2.161 is now UP INFO [GossipStage:1] 2012-05-10 23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.162 is now UP INFO [GossipStage:1] 2012-05-10 23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.163 is now UP INFO [GossipStage:1] 2012-05-10 23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.165 is now UP
> 
>  
> 
> The suspicious fact is that on this node we have several tcp connections to other nodes 7000 port in CLOSE_WAIT state:
> 
> Active Internet connections (servers and established)
> 
> Proto Recv-Q Send-Q Local Address           Foreign Address         State
> 
> tcp   869073      0 rcwocas:afs3-fileserver rcwocas03.enkata.:34274 CLOSE_WAIT
> 
> tcp   463429      0 rcwocas:afs3-fileserver rcwocas02.enkata.:39654 CLOSE_WAIT
> 
> tcp   873838      0 rcwocas:afs3-fileserver rcwocas01.enkata.:49486 CLOSE_WAIT
> 
> tcp   860245      0 rcwocas:afs3-fileserver rcwocas05.enkata.:43028 CLOSE_WAIT
> 
> tcp      112      0 rcwocas:afs3-fileserver rcwocas02.enkata.:40321 CLOSE_WAIT
> 
> tcp     2124      0 rcwocas:afs3-fileserver rcwocas03.enkata.:39338 CLOSE_WAIT
> 
> tcp        0      0 rcwocas:afs3-fileserver rcwocas01.enkata.:56408 ESTABLISHED
> 
> tcp      184      0 rcwocas:afs3-fileserver rcwocas01.enkata.:48862 CLOSE_WAIT
> 
> tcp   534489      0 rcwocas:afs3-fileserver rcwocas02.enkata.:35331 ESTABLISHED
> 
> tcp      886      0 rcwocas:afs3-fileserver rcwocas03.enkata.:56034 CLOSE_WAIT
> 
> tcp        0      0 rcwocas04.Enkata.:48800 rcwocas:afs3-fileserver ESTABLISHED
> 
> tcp        0      0 rcwocas:afs3-fileserver rcwocas01.enkata.:51348 ESTABLISHED
> 
> tcp      187      0 rcwocas:afs3-fileserver rcwocas05.enkata.:45538 CLOSE_WAIT
> 
> tcp      253      0 rcwocas:afs3-fileserver rcwocas03.enkata.:51359 CLOSE_WAIT
> 
>  
> 
> Also I have attached thread dump
> 
>  
> 
> Thanks,
> 
> Pavel
> 
>  
>  


RE: Cassandra stucks

Posted by Pavel Polushkin <pp...@enkata.com>.
We use 1.0.8 version.

 

From: David Leimbach [mailto:leimy2k@gmail.com] 
Sent: Friday, May 11, 2012 18:48
To: user@cassandra.apache.org
Subject: Re: Cassandra stucks

 

What's the version number of Cassandra?

On Fri, May 11, 2012 at 7:38 AM, Pavel Polushkin <pp...@enkata.com>
wrote:

Hello,

 

We faced with a strange problem while testing performance on Cassandra
cluster. After some time all nodes went to down state for several days.
Now all nodes went back to up state and only one node still down.

 

Nodetool on down node throws exception:

Error connection to remote JMX agent!

java.io.IOException: Failed to retrieve RMIServer stub:
javax.naming.CommunicationException [Root exception is
java.rmi.ConnectIOException: error during JRMP connection establishment;
nested exception is:

        java.net.SocketTimeoutException: Read timed out]

        at
javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:340)

        at
javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.
java:248)

        at
org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:144)

        at
org.apache.cassandra.tools.NodeProbe.<init>(NodeProbe.java:114)

        at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:623)

Caused by: javax.naming.CommunicationException [Root exception is
java.rmi.ConnectIOException: error during JRMP connection establishment;
nested exception is:

        java.net.SocketTimeoutException: Read timed out]

        at
com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:10
1)

        at
com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java
:185)

        at javax.naming.InitialContext.lookup(InitialContext.java:392)

        at
javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.
java:1888)

        at
javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java
:1858)

        at
javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:257)

        ... 4 more

Caused by: java.rmi.ConnectIOException: error during JRMP connection
establishment; nested exception is:

        java.net.SocketTimeoutException: Read timed out

        at
sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:286)

        at
sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184)

        at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:322)

        at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source)

        at
com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:97
)

        ... 9 more

Caused by: java.net.SocketTimeoutException: Read timed out

        at java.net.SocketInputStream.socketRead0(Native Method)

        at java.net.SocketInputStream.read(SocketInputStream.java:129)

        at
java.io.BufferedInputStream.fill(BufferedInputStream.java:218)

        at
java.io.BufferedInputStream.read(BufferedInputStream.java:237)

        at java.io.DataInputStream.readByte(DataInputStream.java:248)

        at
sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:228)

        ... 13 more

 

In system log of down node unlimited list of such errors:

INFO [GossipStage:1] 2012-05-10 23:18:27,579 Gossiper.java (line 804)
InetAddress /172.15.2.161 is now UP INFO [GossipStage:1] 2012-05-10
23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.162 is now
UP INFO [GossipStage:1] 2012-05-10 23:18:27,580 Gossiper.java (line 804)
InetAddress /172.15.2.163 is now UP INFO [GossipStage:1] 2012-05-10
23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.165 is now
UP INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818)
InetAddress /172.15.2.161 is now dead.

INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818)
InetAddress /172.15.2.165 is now dead.

INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818)
InetAddress /172.15.2.162 is now dead.

INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818)
InetAddress /172.15.2.163 is now dead.

INFO [GossipStage:1] 2012-05-10 23:18:29,291 Gossiper.java (line 804)
InetAddress /172.15.2.161 is now UP INFO [GossipStage:1] 2012-05-10
23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.162 is now
UP INFO [GossipStage:1] 2012-05-10 23:18:29,292 Gossiper.java (line 804)
InetAddress /172.15.2.163 is now UP INFO [GossipStage:1] 2012-05-10
23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.165 is now
UP

 

The suspicious fact is that on this node we have several tcp connections
to other nodes 7000 port in CLOSE_WAIT state:

Active Internet connections (servers and established)

Proto Recv-Q Send-Q Local Address           Foreign Address
State

tcp   869073      0 rcwocas:afs3-fileserver rcwocas03.enkata.:34274
CLOSE_WAIT

tcp   463429      0 rcwocas:afs3-fileserver rcwocas02.enkata.:39654
CLOSE_WAIT

tcp   873838      0 rcwocas:afs3-fileserver rcwocas01.enkata.:49486
CLOSE_WAIT

tcp   860245      0 rcwocas:afs3-fileserver rcwocas05.enkata.:43028
CLOSE_WAIT

tcp      112      0 rcwocas:afs3-fileserver rcwocas02.enkata.:40321
CLOSE_WAIT

tcp     2124      0 rcwocas:afs3-fileserver rcwocas03.enkata.:39338
CLOSE_WAIT

tcp        0      0 rcwocas:afs3-fileserver rcwocas01.enkata.:56408
ESTABLISHED

tcp      184      0 rcwocas:afs3-fileserver rcwocas01.enkata.:48862
CLOSE_WAIT

tcp   534489      0 rcwocas:afs3-fileserver rcwocas02.enkata.:35331
ESTABLISHED

tcp      886      0 rcwocas:afs3-fileserver rcwocas03.enkata.:56034
CLOSE_WAIT

tcp        0      0 rcwocas04.Enkata.:48800 rcwocas:afs3-fileserver
ESTABLISHED

tcp        0      0 rcwocas:afs3-fileserver rcwocas01.enkata.:51348
ESTABLISHED

tcp      187      0 rcwocas:afs3-fileserver rcwocas05.enkata.:45538
CLOSE_WAIT

tcp      253      0 rcwocas:afs3-fileserver rcwocas03.enkata.:51359
CLOSE_WAIT

 

Also I have attached thread dump

 

Thanks,

Pavel

 

 

 


Re: Cassandra stucks

Posted by David Leimbach <le...@gmail.com>.
What's the version number of Cassandra?

On Fri, May 11, 2012 at 7:38 AM, Pavel Polushkin <pp...@enkata.com>wrote:

> Hello,****
>
> ** **
>
> We faced with a strange problem while testing performance on Cassandra
> cluster. After some time all nodes went to down state for several days. Now
> all nodes went back to up state and only one node still down.****
>
> ** **
>
> Nodetool on down node throws exception:****
>
> Error connection to remote JMX agent!****
>
> java.io.IOException: Failed to retrieve RMIServer stub:
> javax.naming.CommunicationException [Root exception is
> java.rmi.ConnectIOException: error during JRMP connection establishment;
> nested exception is:****
>
>         java.net.SocketTimeoutException: Read timed out]****
>
>         at
> javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:340)***
> *
>
>         at
> javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248)
> ****
>
>         at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:144)
> ****
>
>         at org.apache.cassandra.tools.NodeProbe.<init>(NodeProbe.java:114)
> ****
>
>         at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:623)****
>
> Caused by: javax.naming.CommunicationException [Root exception is
> java.rmi.ConnectIOException: error during JRMP connection establishment;
> nested exception is:****
>
>         java.net.SocketTimeoutException: Read timed out]****
>
>         at
> com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:101)
> ****
>
>         at
> com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java:185)
> ****
>
>         at javax.naming.InitialContext.lookup(InitialContext.java:392)****
>
>         at
> javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.java:1888)
> ****
>
>         at
> javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java:1858)
> ****
>
>         at
> javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:257)***
> *
>
>         ... 4 more****
>
> Caused by: java.rmi.ConnectIOException: error during JRMP connection
> establishment; nested exception is:****
>
>         java.net.SocketTimeoutException: Read timed out****
>
>         at
> sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:286)****
>
>         at
> sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184)****
>
>         at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:322)****
>
>         at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source)****
>
>         at
> com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:97)*
> ***
>
>         ... 9 more****
>
> Caused by: java.net.SocketTimeoutException: Read timed out****
>
>         at java.net.SocketInputStream.socketRead0(Native Method)****
>
>         at java.net.SocketInputStream.read(SocketInputStream.java:129)****
>
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)*
> ***
>
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:237)*
> ***
>
>         at java.io.DataInputStream.readByte(DataInputStream.java:248)****
>
>         at
> sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:228)****
>
>         ... 13 more****
>
> ** **
>
> In system log of down node unlimited list of such errors:****
>
> INFO [GossipStage:1] 2012-05-10 23:18:27,579 Gossiper.java (line 804)
> InetAddress /172.15.2.161 is now UP INFO [GossipStage:1] 2012-05-10
> 23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.162 is now UP
> INFO [GossipStage:1] 2012-05-10 23:18:27,580 Gossiper.java (line 804)
> InetAddress /172.15.2.163 is now UP INFO [GossipStage:1] 2012-05-10
> 23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.165 is now UP
> INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818)
> InetAddress /172.15.2.161 is now dead.****
>
> INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818)
> InetAddress /172.15.2.165 is now dead.****
>
> INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818)
> InetAddress /172.15.2.162 is now dead.****
>
> INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818)
> InetAddress /172.15.2.163 is now dead.****
>
> INFO [GossipStage:1] 2012-05-10 23:18:29,291 Gossiper.java (line 804)
> InetAddress /172.15.2.161 is now UP INFO [GossipStage:1] 2012-05-10
> 23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.162 is now UP
> INFO [GossipStage:1] 2012-05-10 23:18:29,292 Gossiper.java (line 804)
> InetAddress /172.15.2.163 is now UP INFO [GossipStage:1] 2012-05-10
> 23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.165 is now UP*
> ***
>
> ** **
>
> The suspicious fact is that on this node we have several tcp connections
> to other nodes 7000 port in CLOSE_WAIT state:****
>
> Active Internet connections (servers and established)****
>
> Proto Recv-Q Send-Q Local Address           Foreign Address         State*
> ***
>
> tcp   869073      0 rcwocas:afs3-fileserver rcwocas03.enkata.:34274
> CLOSE_WAIT****
>
> tcp   463429      0 rcwocas:afs3-fileserver rcwocas02.enkata.:39654
> CLOSE_WAIT****
>
> tcp   873838      0 rcwocas:afs3-fileserver rcwocas01.enkata.:49486
> CLOSE_WAIT****
>
> tcp   860245      0 rcwocas:afs3-fileserver rcwocas05.enkata.:43028
> CLOSE_WAIT****
>
> tcp      112      0 rcwocas:afs3-fileserver rcwocas02.enkata.:40321
> CLOSE_WAIT****
>
> tcp     2124      0 rcwocas:afs3-fileserver rcwocas03.enkata.:39338
> CLOSE_WAIT****
>
> tcp        0      0 rcwocas:afs3-fileserver rcwocas01.enkata.:56408
> ESTABLISHED****
>
> tcp      184      0 rcwocas:afs3-fileserver rcwocas01.enkata.:48862
> CLOSE_WAIT****
>
> tcp   534489      0 rcwocas:afs3-fileserver rcwocas02.enkata.:35331
> ESTABLISHED****
>
> tcp      886      0 rcwocas:afs3-fileserver rcwocas03.enkata.:56034
> CLOSE_WAIT****
>
> tcp        0      0 rcwocas04.Enkata.:48800 rcwocas:afs3-fileserver
> ESTABLISHED****
>
> tcp        0      0 rcwocas:afs3-fileserver rcwocas01.enkata.:51348
> ESTABLISHED****
>
> tcp      187      0 rcwocas:afs3-fileserver rcwocas05.enkata.:45538
> CLOSE_WAIT****
>
> tcp      253      0 rcwocas:afs3-fileserver rcwocas03.enkata.:51359
> CLOSE_WAIT****
>
> ** **
>
> Also I have attached thread dump****
>
> ** **
>
> Thanks,****
>
> Pavel****
>
> ** **
>
> ** **
>