You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ratis.apache.org by "Marton Elek (Jira)" <ji...@apache.org> on 2020/01/29 12:50:00 UTC

[jira] [Commented] (RATIS-804) Race condition between cache evict and load in LogSegment

    [ https://issues.apache.org/jira/browse/RATIS-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025848#comment-17025848 ] 

Marton Elek commented on RATIS-804:
-----------------------------------

The problematic segment is this (in CacheInvalidationPolicy.java):
{code:java}
if (result.isEmpty()) {
  for (int i = safeIndex; i >= j; i--) {
    LogSegment s = segments.get(i);
    if (s.getStartIndex() > lastAppliedIndex && s.hasCache()) {
      result.add(s);
      break;
    }
  }
} {code}
This is the last segment in the algorithm. The evictImpl:
 # First checks which segments are not flushed. They should be kept
 # (In case of follower) Which segments are already applied
 # (In case of follower and the no segments to remove until this point): *Remove the segments between the lastAppliedIndex and the localFlushIndex* with the hope that it can be loaded any time. It can, but only with locks.

> Race condition between cache evict and load in LogSegment
> ---------------------------------------------------------
>
>                 Key: RATIS-804
>                 URL: https://issues.apache.org/jira/browse/RATIS-804
>             Project: Ratis
>          Issue Type: Bug
>            Reporter: Marton Elek
>            Priority: Critical
>
> I am doing some kind of stress testing with Ozone. I start one Datanode in FOLLOWER mode and the load generator (Freon) behaves like a LEADER.
> I am sending huge number of AppendLogEntries to the FOLLOWER without inhibitions.
> As a result I got NPE:
> {code:java}
> 2020-01-28 15:08:20 ERROR StateMachineUpdater:184 - 3fda0c39-ce3c-4540-a804-44d9ac1f4853@group-E1B13B4CA5C0-StateMachineUpdater: the StateMachineUp
> dater hits Throwable
> org.apache.ratis.server.raftlog.RaftLogIOException: java.lang.NullPointerException
>         at org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:320)
>         at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:293)
>         at org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:218)
>         at org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:167)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>         at java.util.Objects.requireNonNull(Objects.java:203)
>         at org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:214)
>         at org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:318)
>         ... 4 more {code}
> It seems to be a race condition between LogSegment.evictCache() and LogSegment.loadCache().
>  # StateMachineUpdater tries to update the StateMachine with the next log entry
>  # It can't be found in the cache, therefore the LogSegment.loadCache() is called
>  # The LogSegment.LogEntryLoader.load() reads the segment files from the disk
>  # After loading, it returns with the loaded entry
> If the GRPC thread evicts the cache between 3 and 4. (it's possible that the log segment is already flushed, therefore can be evicted) an NPE will be thrown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)