You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Stefan Podkowinski (JIRA)" <ji...@apache.org> on 2017/01/09 11:26:58 UTC

[jira] [Issue Comment Deleted] (CASSANDRA-13058) dtest failure in hintedhandoff_test.TestHintedHandoff.hintedhandoff_decom_test

     [ https://issues.apache.org/jira/browse/CASSANDRA-13058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stefan Podkowinski updated CASSANDRA-13058:
-------------------------------------------
    Comment: was deleted

(was: Edit: looks like I did something wrong while reverting the mentioned commit. It does not effect the test result. Sorry for that.
I still want to leave the rest of my comment here as it may indicate that this could be indeed related to some buggy gossip state/failure detection handling.

-This looks like a regression caused by CASSANDRA-12192 to me-. -The test seems to work fine without the commit.-

Take a look at this dtest log:

{{./run_dtests.py --vnodes false --nose-options -v  hintedhandoff_test.py:TestHintedHandoff.hintedhandoff_decom_test}}

{noformat}
Node 4:
DEBUG [StorageServiceShutdownHook] 2017-01-04 15:06:16,748 StorageService.java:1437 - DRAINING: starting drain process                           
DEBUG [StorageServiceShutdownHook] 2017-01-04 15:06:20,011 StorageService.java:1437 - DRAINED                                                    

Node 1:
INFO  [GossipStage:1] 2017-01-04 15:06:16,756 Gossiper.java:1036 - InetAddress /127.0.0.4 is now DOWN                                           
DEBUG [GossipStage:1] 2017-01-04 15:06:16,757 OutboundTcpConnection.java:180 - Enqueuing socket close for /127.0.0.4
DEBUG [GossipStage:1] 2017-01-04 15:06:16,757 OutboundTcpConnection.java:180 - Enqueuing socket close for /127.0.0.4
DEBUG [GossipStage:1] 2017-01-04 15:06:16,757 OutboundTcpConnection.java:180 - Enqueuing socket close for /127.0.0.4
DEBUG [GossipStage:1] 2017-01-04 15:06:16,757 FailureDetector.java:325 - Forcing conviction of /127.0.0.4                                       
DEBUG [GossipStage:1] 2017-01-04 15:06:16,758 Gossiper.java:370 - /127.0.0.4 marked as alive: false                                             
DEBUG [MessagingService-Outgoing-/127.0.0.4-Gossip] 2017-01-04 15:06:16,758 OutboundTcpConnection.java:374 - Socket to /127.0.0.4 closed        
DEBUG [MessagingService-Outgoing-/127.0.0.4-Small] 2017-01-04 15:06:16,759 OutboundTcpConnection.java:374 - Socket to /127.0.0.4 closed
DEBUG [MessagingService-Outgoing-/127.0.0.4-Gossip] 2017-01-04 15:06:17,883 OutboundTcpConnection.java:388 - Attempting to connect to /127.0.0.4
INFO  [HANDSHAKE-/127.0.0.4] 2017-01-04 15:06:17,884 OutboundTcpConnection.java:510 - Handshaking version with /127.0.0.4
DEBUG [MessagingService-Outgoing-/127.0.0.4-Gossip] 2017-01-04 15:06:17,884 OutboundTcpConnection.java:482 - Done connecting to /127.0.0.4
DEBUG [Native-Transport-Requests-3] 2017-01-04 15:06:20,825 FailureDetector.java:252 - /127.0.0.4 is ALIVE
INFO  [RMI TCP Connection(2)-127.0.0.1] 2017-01-04 15:06:52,930 StorageService.java:1435 - LEAVING: streaming hints to other nodes              
INFO  [HintsDispatcher:2] 2017-01-04 15:06:52,955 HintsDispatchExecutor.java:152 - Transferring all hints to 054006b5-412c-4d4e-a28a-7610574de79     d                                                                                                                                               
DEBUG [MessagingService-Outgoing-/127.0.0.4-Gossip] 2017-01-04 15:06:53,035 OutboundTcpConnection.java:495 - Unable to connect to /127.0.0.4
java.net.ConnectException: Connection refused
▸  at sun.nio.ch.Net.connect0(Native Method) ~[na:1.8.0_112]
▸  at sun.nio.ch.Net.connect(Net.java:454) ~[na:1.8.0_112]
▸  at sun.nio.ch.Net.connect(Net.java:446) ~[na:1.8.0_112]
▸  at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648) ~[na:1.8.0_112]
▸  at org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:146) ~[main/:na]
▸  at org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:132) ~[main/:na]
▸  at org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:397) [main/:na]
▸  at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:234) [main/:na]
{noformat}


I've added some additional log messages, so don't get confused when you can't find some of them in the code. What's happening here is that the gossiper will down node1 and will trigger {{OutboundTcpConnectionPool.reset()}} via {{MessagingService.convict()}}. This will cause a reconnect shortly, as the message should now always be retried from the backlog again. 

Unfortunately I can't see which message is affected here. But my assumption is that node4 keeps accepting messages even for a short time after gossip change propagation and will return a response to node1, which will flag node4 alive again in the process. But I'm not exactly sure, as the node status related code is not easy to understand with all the side effects and unclear state handling. :(

/cc [~thobbs]
)

> dtest failure in hintedhandoff_test.TestHintedHandoff.hintedhandoff_decom_test
> ------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13058
>             Project: Cassandra
>          Issue Type: Test
>          Components: Testing
>            Reporter: Sean McCarthy
>            Priority: Blocker
>              Labels: dtest, test-failure
>             Fix For: 3.10
>
>         Attachments: node1.log, node1_debug.log, node1_gc.log, node2.log, node2_debug.log, node2_gc.log, node3.log, node3_debug.log, node3_gc.log, node4.log, node4_debug.log, node4_gc.log
>
>
> example failure:
> http://cassci.datastax.com/job/cassandra-3.X_novnode_dtest/16/testReport/hintedhandoff_test/TestHintedHandoff/hintedhandoff_decom_test/
> {code}
> Error Message
> Subprocess ['nodetool', '-h', 'localhost', '-p', '7100', ['decommission']] exited with non-zero status; exit status: 2; 
> stderr: error: Error while decommissioning node: Failed to transfer all hints to 59f20b4f-0215-4e18-be1b-7e00f2901629
> {code}{code}
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
>     testMethod()
>   File "/home/automaton/cassandra-dtest/hintedhandoff_test.py", line 167, in hintedhandoff_decom_test
>     node1.decommission()
>   File "/usr/local/lib/python2.7/dist-packages/ccmlib/node.py", line 1314, in decommission
>     self.nodetool("decommission")
>   File "/usr/local/lib/python2.7/dist-packages/ccmlib/node.py", line 783, in nodetool
>     return handle_external_tool_process(p, ['nodetool', '-h', 'localhost', '-p', str(self.jmx_port), cmd.split()])
>   File "/usr/local/lib/python2.7/dist-packages/ccmlib/node.py", line 1993, in handle_external_tool_process
>     raise ToolError(cmd_args, rc, out, err)
> {code}{code}
> java.lang.RuntimeException: Error while decommissioning node: Failed to transfer all hints to 59f20b4f-0215-4e18-be1b-7e00f2901629
> 	at org.apache.cassandra.service.StorageService.decommission(StorageService.java:3924)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:497)
> 	at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:497)
> 	at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275)
> 	at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
> 	at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
> 	at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
> 	at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
> 	at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
> 	at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
> 	at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
> 	at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1466)
> 	at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:76)
> 	at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1307)
> 	at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1399)
> 	at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:828)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:497)
> 	at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:323)
> 	at sun.rmi.transport.Transport$1.run(Transport.java:200)
> 	at sun.rmi.transport.Transport$1.run(Transport.java:197)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
> 	at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:568)
> 	at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826)
> 	at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$241(TCPTransport.java:683)
> 	at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$284/1694175644.run(Unknown Source)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)