You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "ArafatKhan2198 (via GitHub)" <gi...@apache.org> on 2023/11/29 14:31:31 UTC

[PR] HDDS-9766 Intermittent AlreadyClosedException in TestCommitWatcher.testReleaseBuffersOnException [ozone]

ArafatKhan2198 opened a new pull request, #5700:
URL: https://github.com/apache/ozone/pull/5700

   ## What changes were proposed in this pull request?
   As provided the following exceptions have been thrown :- `AlreadyClosedException`, `ExecutionException`, `RaftRetryFailureException`, `NotLeaderException`. 
   
   ```
   org.apache.hadoop.hdds.scm.storage.TestCommitWatcher.testReleaseBuffersOnException -- Time elapsed: 28.88 s <<< ERROR!
   java.util.concurrent.ExecutionException: org.apache.ratis.protocol.exceptions.AlreadyClosedException: SlidingWindow$Client:client-2D6A59F17A72->RAFT is closed.
   	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
   	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
   	at org.apache.hadoop.hdds.scm.storage.TestCommitWatcher.testReleaseBuffersOnException(TestCommitWatcher.java:301)
   ...
   Caused by: org.apache.ratis.protocol.exceptions.AlreadyClosedException: SlidingWindow$Client:client-2D6A59F17A72->RAFT is closed.
   	at org.apache.ratis.util.SlidingWindow$Client.alreadyClosed(SlidingWindow.java:406)
   ...
   Caused by: org.apache.ratis.protocol.exceptions.RaftRetryFailureException: Failed RaftClientRequest:client-2D6A59F17A72->31ca1c78-6c1a-481a-9835-ed9e1bd72d7b@group-4121B0A26A9F, cid=40, seq=1*, Watch(0), null for 3 attempts with RequestTypeDependentRetryPolicy{WRITE->ExceptionDependentRetry(maxAttempts=2147483647; defaultPolicy=MultipleLinearRandomRetry[5x5s, 5x10s, 5x15s, 5x20s, 5x25s, 10x60s]; map={org.apache.ratis.protocol.exceptions.GroupMismatchException->NoRetry, org.apache.ratis.protocol.exceptions.NotReplicatedException->NoRetry, org.apache.ratis.protocol.exceptions.ResourceUnavailableException->org.apache.ratis.retry.ExponentialBackoffRetry@571a7a22, org.apache.ratis.protocol.exceptions.StateMachineException->NoRetry, org.apache.ratis.protocol.exceptions.TimeoutIOException->org.apache.ratis.retry.ExponentialBackoffRetry@571a7a22}), WATCH->ExceptionDependentRetry(maxAttempts=2147483647; defaultPolicy=MultipleLinearRandomRetry[5x5s, 5x10s, 5x15s, 5x20s, 5x25s, 10x60s]; map=
 {org.apache.ratis.protocol.exceptions.GroupMismatchException->NoRetry, org.apache.ratis.protocol.exceptions.NotReplicatedException->NoRetry, org.apache.ratis.protocol.exceptions.ResourceUnavailableException->org.apache.ratis.retry.ExponentialBackoffRetry@571a7a22, org.apache.ratis.protocol.exceptions.StateMachineException->NoRetry, org.apache.ratis.protocol.exceptions.TimeoutIOException->NoRetry})}
   	at org.apache.ratis.client.impl.RaftClientImpl.noMoreRetries(RaftClientImpl.java:353)
   	... 20 more
   Caused by: org.apache.ratis.protocol.exceptions.NotLeaderException: Server 31ca1c78-6c1a-481a-9835-ed9e1bd72d7b@group-4121B0A26A9F is not the leader 66f0d3a6-9de9-4155-af63-8575b18d2e1d|10.1.0.11:15121
   	at org.apache.ratis.client.impl.ClientProtoUtils.toRaftClientReply(ClientProtoUtils.java:397)
   	at org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:310)
   	... 11 more
   ```
   
   Solution for the fix :- 
   - The solution is to modify the test environment by reducing the number of datanodes and pipelines in the MiniOzoneCluster configuration.  
   - With fewer datanodes and pipelines, there's less likelihood of encountering resource contention and timing issues. Such issues can often lead to intermittent failures that are hard to reproduce and diagnose.
   
   ## What is the link to the Apache JIRA
   https://issues.apache.org/jira/browse/HDDS-9766
   ## How was this patch tested?
   
   Ran it 300 times in my fork, and passed successfully :- 
   
   Test Run 1 :- https://github.com/ArafatKhan2198/ozone/actions/runs/7031682772
   Test Run 2 :- https://github.com/ArafatKhan2198/ozone/actions/runs/7031685390
   Test Run 3 :- https://github.com/ArafatKhan2198/ozone/actions/runs/7031688216


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9766 Intermittent AlreadyClosedException in TestCommitWatcher.testReleaseBuffersOnException [ozone]

Posted by "ArafatKhan2198 (via GitHub)" <gi...@apache.org>.
ArafatKhan2198 commented on PR #5700:
URL: https://github.com/apache/ozone/pull/5700#issuecomment-1832120339

   @adoroszlai 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9766. Intermittent AlreadyClosedException in TestCommitWatcher.testReleaseBuffersOnException [ozone]

Posted by "nandakumar131 (via GitHub)" <gi...@apache.org>.
nandakumar131 commented on PR #5700:
URL: https://github.com/apache/ozone/pull/5700#issuecomment-1832647735

   +1, LGTM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9766. Intermittent AlreadyClosedException in TestCommitWatcher.testReleaseBuffersOnException [ozone]

Posted by "nandakumar131 (via GitHub)" <gi...@apache.org>.
nandakumar131 merged PR #5700:
URL: https://github.com/apache/ozone/pull/5700


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org