You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Mirza Aliev (Jira)" <ji...@apache.org> on 2024/03/15 15:26:00 UTC
[jira] [Commented] (IGNITE-21381) ActiveActorTest#testChangeLeaderForce has problems with resource cleanup

    [ https://issues.apache.org/jira/browse/IGNITE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17827534#comment-17827534 ] 

Mirza Aliev commented on IGNITE-21381:
--------------------------------------

[~v.pyatkov] LGTM, thanks for the contribution! 

> ActiveActorTest#testChangeLeaderForce has problems with resource cleanup
> ------------------------------------------------------------------------
>
>                 Key: IGNITE-21381
>                 URL: https://issues.apache.org/jira/browse/IGNITE-21381
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mirza Aliev
>            Assignee: Vladislav Pyatkov
>            Priority: Major
>              Labels: ignite-3
>         Attachments: screenshot-1.png, screenshot-2.png
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> {{ActiveActorTest#testChangeLeaderForce}} is started to be flaky on TC with 
> {noformat}
> [05:19:12]F:			 [org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(TestInfo)] org.opentest4j.AssertionFailedError: expected: <true> but was: <false>
> 	at app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
> 	at app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
> 	at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
> 	at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
> 	at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)
> 	at app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:180)
> 	at app//org.apache.ignite.internal.placementdriver.ActiveActorTest.testChangeLeaderForce(ActiveActorTest.java:370)
> {noformat}
> From the log we can see that transfer leadership, which was supposed to be successful, do not happen. Behaviour is the following:
> 1) Current leader is {{Leader: ClusterNodeImpl [id=e99210fb-f872-4e08-a99c-53f9512da20e, name=aat_tclf_1235}}
> 2) We want to transfer leadership to {{Peer to transfer leader: Peer [consistentId=aat_tclf_1234, idx=0]}}
> 3) Process of transfer is started
> 4) We receive warn about error during {{GetLeaderRequestImpl}}:
> {noformat}
> [2024-01-29T05:19:08,855][WARN ][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable error during the request occurred (will be retried on the randomly selected node) [request=GetLeaderRequestImpl [groupId=TestReplicationGroup, peerId=aat_tclf_1235], peer=Peer [consistentId=aat_tclf_1235, idx=0], newPeer=Peer [consistentId=aat_tclf_1234, idx=0]].
> java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
> 	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367) ~[?:?]
> 	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376) ~[?:?]
> 	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019) ~[?:?]
> 	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) [?:?]
> 	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) [?:?]
> 	at java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792) [?:?]
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
> 	at java.lang.Thread.run(Thread.java:834) [?:?]
> Caused by: java.util.concurrent.TimeoutException
> 	... 7 more
> {noformat}
> 5) After that we see that node {{aat_tclf_1236}} sends invalid {{RequestVoteResponse}} because it thinks that it is the leader:
> {noformat}
> [2024-01-29T05:19:11,370][WARN ][%aat_tclf_1234%JRaft-Response-Processor-15][NodeImpl] Node <TestReplicationGroup/aat_tclf_1234> received invalid RequestVoteResponse from aat_tclf_1236, state not in STATE_CANDIDATE but STATE_LEADER.
> {noformat}
>  
> Tests {{ActiveActorTest#testChangeLeaderForce}} and 
> {{TopologyAwareRaftGroupServiceTest#testChangeLeaderForce}} were muted.
> Also there are some other problems with this tests, they incorrectly clean up resources in case of failure. Cluster is stopped in test itself, meaning that if some assertion is failed, the rest part of the test won't be evaluated, hence cluster won't be stopped.
> The next problem is that if we run this test a several times, even if they pass successfully, we can see that at some point new test cannot be run because of 
> {noformat}
>  java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
> {noformat}
> From visualvm we can see, that {{Raft-Group-Client}} threads leaked:
>  !screenshot-1.png! 
>  !screenshot-2.png! 
> h4. Definition of done
> 1) Investigate and fix the problem with the failed transferLeadership
> 2) Correctly clean up resources if test is failed. Move all cleanup logic to {{AfterEach}} section of tests for all {{ActiveActorTest}} and 
> {{TopologyAwareRaftGroupServiceTest}}
> 3) Refactor {{ActiveActorTest}} and {{TopologyAwareRaftGroupServiceTest}}, the code is just copy-pasted
> 4) Investigate the problem with leaked {{Raft-Group-Client}} threads 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)