You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ratis.apache.org by "Elek, Marton (JIRA)" <ji...@apache.org> on 2017/12/03 14:29:00 UTC

[jira] [Created] (RATIS-163) TestRaftWithHadoopRpc fails becuse hadoop rpc retry logic

Elek, Marton created RATIS-163:
----------------------------------

             Summary: TestRaftWithHadoopRpc fails becuse hadoop rpc retry logic
                 Key: RATIS-163
                 URL: https://issues.apache.org/jira/browse/RATIS-163
             Project: Ratis
          Issue Type: Bug
            Reporter: Elek, Marton
            Assignee: Elek, Marton


During the last qbt nightly build TestRaftWithHadoopRpc is failed.

The problem could be reproduced locally:

mvn test -Dtest=TestRaftWithHadoopRpc#testBasicLeaderElection

The key output is at the end of the log file:

{code}
2017-12-03 15:25:00,966 INFO  ipc.Client (Client.java:handleConnectionFailure(940)) - Retrying connect to server: 0.0.0.0/0.0.0.0:46409. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2017-12-03 15:25:00,967 WARN  ipc.Client (Client.java:handleConnectionFailure(922)) - Failed to connect to server: 0.0.0.0/0.0.0.0:46409: retries get failed due to exceeded maximum allowed retries number: 10
java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:679)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:775)
	at org.apache.hadoop.ipc.Client$Connection.access$3300(Client.java:410)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1556)
	at org.apache.hadoop.ipc.Client.call(Client.java:1387)
	at org.apache.hadoop.ipc.Client.call(Client.java:1351)
	at org.apache.hadoop.ipc.ProtobufRpcEngineShaded$Invoker.invoke(ProtobufRpcEngineShaded.java:214)
	at com.sun.proxy.$Proxy13.requestVote(Unknown Source)
	at org.apache.ratis.hadooprpc.server.HadoopRpcService.lambda$requestVote$4(HadoopRpcService.java:176)
	at org.apache.ratis.hadooprpc.server.HadoopRpcService.processRequest(HadoopRpcService.java:188)
	at org.apache.ratis.hadooprpc.server.HadoopRpcService.requestVote(HadoopRpcService.java:175)
	at org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:189)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
{code}

In this test case the unit test just kills all the leaders one by one. If one leader is killed the other follower still tries to connect to them. At every voterequest the running nodes will (try to) send a message to the killed nodes.

But there is a retry logic in Hadoop RPC by default. So the LeaderElection.submitRequest/requestVote method (which is executed in a spereated executor) won't be finished even if the LeaderElection is stopped. The requestVote task should be finised quite fast by default, but in this case hadop rpc just tries to reconnect again and again, so the internal executor of the LeaderElection will work even if the LeaderElection itself is stopped.

The easiest way to solve this to disable hadoop ipc retry. I suggest this (at least for now), as the current test failure is not a real test case failure, just the junit test framework can't finish the test method as there are still ongoing hadoop rpc clients.

The tricky solution would be to try to stop existing hadoop client request in case of the LeaderElection shutdown.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)