You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Raju Balpande (Jira)" <ji...@apache.org> on 2023/10/16 09:29:00 UTC

[jira] [Comment Edited] (HDDS-6115) Intermittent failure in TestContainerStateMachineFailures#testApplyTransactionIdempotencyWithClosedContainer

    [ https://issues.apache.org/jira/browse/HDDS-6115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17775623#comment-17775623 ] 

Raju Balpande edited comment on HDDS-6115 at 10/16/23 9:28 AM:
---------------------------------------------------------------

After I added the logs I observed a pattern in failure cases that a method RaftLogIndex.updateIncreasingly() which is part of ratis library is not been called or can say missed or can say been delayed to call.
I tried to debug the flow and observed it been called by a separate thread which is triggered but doesn't have surety to complete before our code gets finished.

The call log I see is as follows:
{code:java}
updateIncreasingly:68, RaftLogIndex (org.apache.ratis.server.raftlog)
updateFlushedIndexIncreasingly:429, SegmentedRaftLogWorker (org.apache.ratis.server.raftlog.segmented)
updateFlushedIndexIncreasingly:424, SegmentedRaftLogWorker (org.apache.ratis.server.raftlog.segmented)
flushIfNecessary:391, SegmentedRaftLogWorker (org.apache.ratis.server.raftlog.segmented)
access$1200:61, SegmentedRaftLogWorker (org.apache.ratis.server.raftlog.segmented)
execute:566, SegmentedRaftLogWorker$WriteLog (org.apache.ratis.server.raftlog.segmented)
run:316, SegmentedRaftLogWorker (org.apache.ratis.server.raftlog.segmented)
run:-1, 0x0000000000000000 (org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$$Lambda$780)
call:511, Executors$RunnableAdapter (java.util.concurrent)
run$$$capture:266, FutureTask (java.util.concurrent)
run:-1, FutureTask (java.util.concurrent)
 - Async stack trace
<init>:151, FutureTask (java.util.concurrent)
newTaskFor:87, AbstractExecutorService (java.util.concurrent)
submit:111, AbstractExecutorService (java.util.concurrent)
submit:678, Executors$DelegatedExecutorService (java.util.concurrent)
start:242, SegmentedRaftLogWorker (org.apache.ratis.server.raftlog.segmented)
openImpl:216, SegmentedRaftLog (org.apache.ratis.server.raftlog.segmented)
open:251, RaftLogBase (org.apache.ratis.server.raftlog)
initRaftLog:239, ServerState (org.apache.ratis.server.impl)
initRaftLog:220, ServerState (org.apache.ratis.server.impl)
lambda$new$5:161, ServerState (org.apache.ratis.server.impl)
get:-1, ServerState$$Lambda$678/0x0000000000000000 (org.apache.ratis.server.impl)
get:62, MemoizedSupplier (org.apache.ratis.util)
initialize:177, ServerState (org.apache.ratis.server.impl)
start:338, RaftServerImpl (org.apache.ratis.server.impl)
accept:-1, RaftServerProxy$$Lambda$723/0x0000000000000000 (org.apache.ratis.server.impl)
accept:188, ConcurrentUtils (org.apache.ratis.util)
lambda$null$4:180, ConcurrentUtils (org.apache.ratis.util)
run:-1, ConcurrentUtils$$Lambda$645/0x0000000000000000 (org.apache.ratis.util)
runWorker:1149, ThreadPoolExecutor (java.util.concurrent)
run:624, ThreadPoolExecutor$Worker (java.util.concurrent)
run:823, Thread (java.lang) {code}
To prove my assumption I introduced some wait in main thread and as I see it worked fine for 10splits*4iterations in [https://github.com/raju-balpande/apache_ozone/actions/runs/6511303376]


was (Author: JIRAUSER296391):
After I added the logs I observed a pattern in failure cases that a method RaftLogIndex.updateIncreasingly() which is part of ratis library is not been called or can say missed or can say been delayed to call.
I tried to debug the flow and observed it been called by a separate thread which is triggered but doesn't have surety to complete before our code gets finished.

The call log I see is as follows:
{code:java}
call:511, Executors$RunnableAdapter (java.util.concurrent)
run$$$capture:266, FutureTask (java.util.concurrent)
run:-1, FutureTask (java.util.concurrent)
 - Async stack trace
<init>:151, FutureTask (java.util.concurrent)
newTaskFor:87, AbstractExecutorService (java.util.concurrent)
submit:111, AbstractExecutorService (java.util.concurrent)
submit:678, Executors$DelegatedExecutorService (java.util.concurrent)
start:242, SegmentedRaftLogWorker (org.apache.ratis.server.raftlog.segmented)
openImpl:216, SegmentedRaftLog (org.apache.ratis.server.raftlog.segmented)
open:251, RaftLogBase (org.apache.ratis.server.raftlog)
initRaftLog:239, ServerState (org.apache.ratis.server.impl)
initRaftLog:220, ServerState (org.apache.ratis.server.impl)
lambda$new$5:161, ServerState (org.apache.ratis.server.impl)
get:-1, ServerState$$Lambda$678/0x0000000000000000 (org.apache.ratis.server.impl)
get:62, MemoizedSupplier (org.apache.ratis.util)
initialize:177, ServerState (org.apache.ratis.server.impl)
start:338, RaftServerImpl (org.apache.ratis.server.impl)
accept:-1, RaftServerProxy$$Lambda$723/0x0000000000000000 (org.apache.ratis.server.impl)
accept:188, ConcurrentUtils (org.apache.ratis.util)
lambda$null$4:180, ConcurrentUtils (org.apache.ratis.util)
run:-1, ConcurrentUtils$$Lambda$645/0x0000000000000000 (org.apache.ratis.util)
runWorker:1149, ThreadPoolExecutor (java.util.concurrent)
run:624, ThreadPoolExecutor$Worker (java.util.concurrent)
run:823, Thread (java.lang) {code}
To prove my assumption I introduced some wait in main thread and as I see it worked fine for 10splits*4iterations in [https://github.com/raju-balpande/apache_ozone/actions/runs/6511303376]

> Intermittent failure in TestContainerStateMachineFailures#testApplyTransactionIdempotencyWithClosedContainer
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HDDS-6115
>                 URL: https://issues.apache.org/jira/browse/HDDS-6115
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: test
>            Reporter: Attila Doroszlai
>            Assignee: Raju Balpande
>            Priority: Major
>
> {code:title=https://github.com/elek/ozone-build-results/blob/master/2021/12/08/11979/it-client/hadoop-ozone/integration-test/org.apache.hadoop.ozone.client.rpc.TestContainerStateMachineFailures.txt}
> Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 138.65 s <<< FAILURE! - in org.apache.hadoop.ozone.client.rpc.TestContainerStateMachineFailures
> testApplyTransactionIdempotencyWithClosedContainer  Time elapsed: 1.13 s  <<< FAILURE!
> java.lang.AssertionError
> 	at org.junit.Assert.fail(Assert.java:87)
> 	at org.junit.Assert.assertTrue(Assert.java:42)
> 	at org.junit.Assert.assertFalse(Assert.java:65)
> 	at org.junit.Assert.assertFalse(Assert.java:75)
> 	at org.apache.hadoop.ozone.client.rpc.TestContainerStateMachineFailures.testApplyTransactionIdempotencyWithClosedContainer(TestContainerStateMachineFailures.java:465)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org