You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ratis.apache.org by "Tsz-wo Sze (Jira)" <ji...@apache.org> on 2020/02/06 00:59:00 UTC

[jira] [Commented] (RATIS-804) Race condition between cache evict and load in LogSegment

    [ https://issues.apache.org/jira/browse/RATIS-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031161#comment-17031161 ] 

Tsz-wo Sze commented on RATIS-804:
----------------------------------

The loadCache(..) method is synchronized but the evictCache(), clear() and truncate(..) methods are not.  It seems a bug.

> Race condition between cache evict and load in LogSegment
> ---------------------------------------------------------
>
>                 Key: RATIS-804
>                 URL: https://issues.apache.org/jira/browse/RATIS-804
>             Project: Ratis
>          Issue Type: Bug
>            Reporter: Marton Elek
>            Priority: Critical
>
> I am doing some kind of stress testing with Ozone. I start one Datanode in FOLLOWER mode and the load generator (Freon) behaves like a LEADER.
> I am sending huge number of AppendLogEntries to the FOLLOWER without inhibitions.
> As a result I got NPE:
> {code:java}
> 2020-01-28 15:08:20 ERROR StateMachineUpdater:184 - 3fda0c39-ce3c-4540-a804-44d9ac1f4853@group-E1B13B4CA5C0-StateMachineUpdater: the StateMachineUp
> dater hits Throwable
> org.apache.ratis.server.raftlog.RaftLogIOException: java.lang.NullPointerException
>         at org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:320)
>         at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:293)
>         at org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:218)
>         at org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:167)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>         at java.util.Objects.requireNonNull(Objects.java:203)
>         at org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:214)
>         at org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:318)
>         ... 4 more {code}
> It seems to be a race condition between LogSegment.evictCache() and LogSegment.loadCache().
>  # StateMachineUpdater tries to update the StateMachine with the next log entry
>  # It can't be found in the cache, therefore the LogSegment.loadCache() is called
>  # The LogSegment.LogEntryLoader.load() reads the segment files from the disk
>  # After loading, it returns with the loaded entry
> If the GRPC thread evicts the cache between 3 and 4. (it's possible that the log segment is already flushed, therefore can be evicted) an NPE will be thrown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)