You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ratis.apache.org by "Song Ziyang (Jira)" <ji...@apache.org> on 2022/11/21 15:07:00 UTC
[jira] [Resolved] (RATIS-1699) Data race condition detected in LeaderStateImpl

     [ https://issues.apache.org/jira/browse/RATIS-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Song Ziyang resolved RATIS-1699.
--------------------------------
    Resolution: Not A Problem

This error occurs rarely. It is more of a cascading error of OOM.

> Data race condition detected in LeaderStateImpl
> -----------------------------------------------
>
>                 Key: RATIS-1699
>                 URL: https://issues.apache.org/jira/browse/RATIS-1699
>             Project: Ratis
>          Issue Type: Bug
>    Affects Versions: 2.3.0
>            Reporter: Song Ziyang
>            Priority: Major
>         Attachments: error_stack.log, related_log.log
>
>
> Hi, we observed abnormal behaviors of GrpcLogAppender in a recent run: It keeps *restarting* and *encountering IllegalArgumentException* and *restarting.* Please refer to the attachment error_stack.log for more details.
> It appears that the SegmentRaftLog is closed, but the GrpcLogAppender is still trying to read the log, and thus cause the IllegalArgumentException.
> After carefully inspecting the related code, we came up with a possible event sequence that lead to this *forever loop of restarting.*
>  
> The LeaderStateImpl method restart([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L532|https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L532)])and stop([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L326]) have no synchronization coordinations, so they may cause potential race conditions.
> Consider the following event sequence:
>  # GrpcLogAppender encounters a OOM error, and transitions to EXCEPTION state, then call LeaderStateImpl's restart() method to recover. ([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderDaemon.java#L90])
>  # StateMachine throws a OOM, which propagates to StateMachineUpdater, and cause the RaftServerImpl to close. ([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/StateMachineUpdater.java#L197])
>  # The RaftServerImpl.close in turn calls shutdownLeaderState and closes all logAppenders. ([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L329])
>  # When step 3 returns, RaftServerImpl assumes all logAppenders are closed, thus calls SegmentRaftLog.close() to safely close the Raft Log. ([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L416])
>  # *{color:#de350b}Here comes the data race.{color}* The restart() in step 1 keeps executing in parallel and starts a new LogAppender. This leads to a conflict: RaftServerimpl thinks it closes all logAppenders, but there is one leaking out. When the newly restarted LogAppender tries to read RaftLog, it encountered IllegalArgumentException, as the RaftLog is already closed by RaftServerImpl. *It then transitions to EXCEPTION and restarts again.*
>  # Step 5 loops over and over again, leading to overwhelming error logs and system livelocks.
> To confirm this, we read the log throughly again. The existing logs shows a similar event sequence. Please refer to related_log.log
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)