You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ratis.apache.org by "Elek, Marton (JIRA)" <ji...@apache.org> on 2017/12/03 14:29:00 UTC
[jira] [Created] (RATIS-163) TestRaftWithHadoopRpc fails becuse
hadoop rpc retry logic
Elek, Marton created RATIS-163:
----------------------------------
Summary: TestRaftWithHadoopRpc fails becuse hadoop rpc retry logic
Key: RATIS-163
URL: https://issues.apache.org/jira/browse/RATIS-163
Project: Ratis
Issue Type: Bug
Reporter: Elek, Marton
Assignee: Elek, Marton
During the last qbt nightly build TestRaftWithHadoopRpc is failed.
The problem could be reproduced locally:
mvn test -Dtest=TestRaftWithHadoopRpc#testBasicLeaderElection
The key output is at the end of the log file:
{code}
2017-12-03 15:25:00,966 INFO ipc.Client (Client.java:handleConnectionFailure(940)) - Retrying connect to server: 0.0.0.0/0.0.0.0:46409. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2017-12-03 15:25:00,967 WARN ipc.Client (Client.java:handleConnectionFailure(922)) - Failed to connect to server: 0.0.0.0/0.0.0.0:46409: retries get failed due to exceeded maximum allowed retries number: 10
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:679)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:775)
at org.apache.hadoop.ipc.Client$Connection.access$3300(Client.java:410)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1556)
at org.apache.hadoop.ipc.Client.call(Client.java:1387)
at org.apache.hadoop.ipc.Client.call(Client.java:1351)
at org.apache.hadoop.ipc.ProtobufRpcEngineShaded$Invoker.invoke(ProtobufRpcEngineShaded.java:214)
at com.sun.proxy.$Proxy13.requestVote(Unknown Source)
at org.apache.ratis.hadooprpc.server.HadoopRpcService.lambda$requestVote$4(HadoopRpcService.java:176)
at org.apache.ratis.hadooprpc.server.HadoopRpcService.processRequest(HadoopRpcService.java:188)
at org.apache.ratis.hadooprpc.server.HadoopRpcService.requestVote(HadoopRpcService.java:175)
at org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:189)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
In this test case the unit test just kills all the leaders one by one. If one leader is killed the other follower still tries to connect to them. At every voterequest the running nodes will (try to) send a message to the killed nodes.
But there is a retry logic in Hadoop RPC by default. So the LeaderElection.submitRequest/requestVote method (which is executed in a spereated executor) won't be finished even if the LeaderElection is stopped. The requestVote task should be finised quite fast by default, but in this case hadop rpc just tries to reconnect again and again, so the internal executor of the LeaderElection will work even if the LeaderElection itself is stopped.
The easiest way to solve this to disable hadoop ipc retry. I suggest this (at least for now), as the current test failure is not a real test case failure, just the junit test framework can't finish the test method as there are still ongoing hadoop rpc clients.
The tricky solution would be to try to stop existing hadoop client request in case of the LeaderElection shutdown.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)