You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ratis.apache.org by "Lokesh Jain (Jira)" <ji...@apache.org> on 2019/12/16 13:22:00 UTC

[jira] [Comment Edited] (RATIS-762) Changing server's log corruption policy can lead to assertion error

    [ https://issues.apache.org/jira/browse/RATIS-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997055#comment-16997055 ] 

Lokesh Jain edited comment on RATIS-762 at 12/16/19 1:21 PM:
-------------------------------------------------------------

[~szetszwo] Thanks for working on the issue! [~pifta] Thanks for filing the Jira and the reviews! I have committed the patch to master branch.


was (Author: ljain):
[~szetszwo] Thanks for working on the issue! [~pifta] Thanks for filing the jira! I have committed the patch to master branch.

> Changing server's log corruption policy can lead to assertion error
> -------------------------------------------------------------------
>
>                 Key: RATIS-762
>                 URL: https://issues.apache.org/jira/browse/RATIS-762
>             Project: Ratis
>          Issue Type: Bug
>          Components: server
>            Reporter: Istvan Fajth
>            Assignee: Tsz-wo Sze
>            Priority: Major
>             Fix For: 0.5.0
>
>         Attachments: r762_20191203.patch, r762_20191213.patch
>
>
> After restarting the cluster I have ran into the exception discussed and partly fixed by RATIS-677.
> The exception:
> {code}
> ERROR org.apache.hadoop.ozone.container.common.statemachine.EndpointStateMachine: Unable to communicate to SCM server at <scm_host>:9861 for past X seconds.
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated checksum is Y but read checksum is 00000000.
>         at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>         at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>         at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>         at org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>         at org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:134)
>         at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:318)
>         at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:252)
>         at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:221)
>         at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
>         at org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:191)
>         at org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:121)
>         at org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:113)
>         at org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
>         at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
> The code contains the new option about CorruptionPolicy, however HDDS DataNode does not set this property for the initialised Raft server I will create a separate JIRA for that.
> After I modified the code, and specified the {{raft.server.log.corruption.policy}} as {{WARN_AND_RETURN}}, the code properly printed a warning as:
> {code}
> WARN org.apache.ratis.server.raftlog.segmented.LogSegment: Failed to read segment file /data/3/hadoop-ozone/datanode/ratis/data/9282c3f1-5efa-48e2-8381-af1a20ce871d/current/log_48957-49189 (start=48957, end=49189, isOpen? false): only 127 entries read successfully
> org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated checksum is Y but read checksum is 00000000.
>         at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
>         at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
>         at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
>         at org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
>         at org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:134)
>         at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:318)
>         at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:252)
>         at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:221)
>         at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
>         at org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:191)
>         at org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:121)
>         at org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:113)
>         at org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
>         at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
> But about 1.6 seconds later the system throws the following exception, writing the following log lines in between the warning and the exception:
> {code}
> INFO org.apache.ratis.server.raftlog.segmented.LogSegment: Successfully read 127 entries from segment file /data/3/hadoop-ozone/datanode/ratis/data/9282c3f1-5efa-48e2-8381-af1a20ce871d/current/log_48957-49189
> INFO org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer: Attempting to start container services.
> INFO org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer: Background container scanner has been disabled.
> INFO org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: Starting XceiverServerRatis ee1c9457-fb78-4a58-8b8d-1a39abee88ff at port 9858
> ERROR org.apache.hadoop.ozone.container.common.statemachine.EndpointStateMachine: Unable to communicate to SCM server at <scm_host>:9861 for past X seconds.
> java.io.IOException: java.lang.IllegalStateException
>         at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
>         at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
>         at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:70)
>         at org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:284)
>         at org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:296)
>         at org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:422)
>         at org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:215)
>         at org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:110)
>         at org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:42)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalStateException
>         at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:36)
>         at org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:156)
>         at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:318)
>         at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:252)
>         at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:221)
>         at org.apache.ratis.server.raftlog.RaftLog.open(RaftLog.java:247)
>         at org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:191)
>         at org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:121)
>         at org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:113)
>         at org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
>         at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>         ... 3 more
> {code}
> So changing the policy effectively just pushes the issue further down in the LogSegment.load code, where there is an assertion that we were able to read until the end, which is not true if there is a checksum error in the meantime.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)