You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ratis.apache.org by "Lokesh Jain (Jira)" <ji...@apache.org> on 2020/02/14 13:32:00 UTC

[jira] [Comment Edited] (RATIS-815) Log entry corrupted with 0 checksum

    [ https://issues.apache.org/jira/browse/RATIS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036965#comment-17036965 ] 

Lokesh Jain edited comment on RATIS-815 at 2/14/20 1:31 PM:
------------------------------------------------------------

Analysis so far.
 # On avoiding checksum exception the parsing of log entry fails with below exception.
{code:java}
org.apache.ratis.thirdparty.com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).org.apache.ratis.thirdparty.com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero). at org.apache.ratis.thirdparty.com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:102) at org.apache.ratis.thirdparty.com.google.protobuf.CodedInputStream$ArrayDecoder.readTag(CodedInputStream.java:627) at org.apache.ratis.proto.RaftProtos$StateMachineLogEntryProto.<init>(RaftProtos.java:5395) at org.apache.ratis.proto.RaftProtos$StateMachineLogEntryProto.<init>(RaftProtos.java:5356) at org.apache.ratis.proto.RaftProtos$StateMachineLogEntryProto$1.parsePartialFrom(RaftProtos.java:6235) at org.apache.ratis.proto.RaftProtos$StateMachineLogEntryProto$1.parsePartialFrom(RaftProtos.java:6229) at org.apache.ratis.thirdparty.com.google.protobuf.CodedInputStream$ArrayDecoder.readMessage(CodedInputStream.java:888) at org.apache.ratis.proto.RaftProtos$LogEntryProto.<init>(RaftProtos.java:6866) at org.apache.ratis.proto.RaftProtos$LogEntryProto.<init>(RaftProtos.java:6808) at org.apache.ratis.proto.RaftProtos$LogEntryProto$1.parsePartialFrom(RaftProtos.java:8018) at org.apache.ratis.proto.RaftProtos$LogEntryProto$1.parsePartialFrom(RaftProtos.java:8012) at org.apache.ratis.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:86) at org.apache.ratis.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:91) at org.apache.ratis.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:48) at org.apache.ratis.thirdparty.com.google.protobuf.GeneratedMessageV3.parseWithIOException(GeneratedMessageV3.java:365) at org.apache.ratis.proto.RaftProtos$LogEntryProto.parseFrom(RaftProtos.java:7273) at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:321) at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194) at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129) at org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98) at org.apache.ratis.tools.ParseRatisLog.dumpSegmentFile(ParseRatisLog.java:59) at org.apache.ratis.tools.DefaultLogDump.main(DefaultLogDump.java:40){code}

 # On avoiding both checksum check and parsing of log entry the subsequent entries can be parsed. But there are multiple entries in the same log segment file which seem to have the corruption problem.
 # Tried to debug the code as to where parsing of the entry was failing. All the entries seen so far are of type StateMachineLogEntryProto. The parsing fails after reading the logData field in StateMachineLogEntryProto. It is able to correctly read term and index fields.

Possible reasons of corruption which are being explored.
 # Corruption during flush of log.
 # Error related to StateMachineEntry which is part of StateMachineLogEntryProto. This proto contains the stateMachineData.
 # Corruption related to change in RATIS-767. We reuse a byte buffer in this Jira for log write.


was (Author: ljain):
Analysis so far.
 # On avoiding checksum exception the parsing of log entry fails with below exception.
{code:java}
org.apache.ratis.thirdparty.com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).org.apache.ratis.thirdparty.com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero). at org.apache.ratis.thirdparty.com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:102) at org.apache.ratis.thirdparty.com.google.protobuf.CodedInputStream$ArrayDecoder.readTag(CodedInputStream.java:627) at org.apache.ratis.proto.RaftProtos$StateMachineLogEntryProto.<init>(RaftProtos.java:5395) at org.apache.ratis.proto.RaftProtos$StateMachineLogEntryProto.<init>(RaftProtos.java:5356) at org.apache.ratis.proto.RaftProtos$StateMachineLogEntryProto$1.parsePartialFrom(RaftProtos.java:6235) at org.apache.ratis.proto.RaftProtos$StateMachineLogEntryProto$1.parsePartialFrom(RaftProtos.java:6229) at org.apache.ratis.thirdparty.com.google.protobuf.CodedInputStream$ArrayDecoder.readMessage(CodedInputStream.java:888) at org.apache.ratis.proto.RaftProtos$LogEntryProto.<init>(RaftProtos.java:6866) at org.apache.ratis.proto.RaftProtos$LogEntryProto.<init>(RaftProtos.java:6808) at org.apache.ratis.proto.RaftProtos$LogEntryProto$1.parsePartialFrom(RaftProtos.java:8018) at org.apache.ratis.proto.RaftProtos$LogEntryProto$1.parsePartialFrom(RaftProtos.java:8012) at org.apache.ratis.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:86) at org.apache.ratis.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:91) at org.apache.ratis.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:48) at org.apache.ratis.thirdparty.com.google.protobuf.GeneratedMessageV3.parseWithIOException(GeneratedMessageV3.java:365) at org.apache.ratis.proto.RaftProtos$LogEntryProto.parseFrom(RaftProtos.java:7273) at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:321) at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194) at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129) at org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98) at org.apache.ratis.tools.ParseRatisLog.dumpSegmentFile(ParseRatisLog.java:59) at org.apache.ratis.tools.DefaultLogDump.main(DefaultLogDump.java:40){code}

 # On avoiding both checksum check and parsing of log entry the subsequent entries can be parsed. But there are multiple entries in the same log segment file which seem to have the corruption problem.
 # Tried to debug the code as to where parsing of the entry was failing. All the entries seen so far are of type StateMachineLogEntryProto. The parsing fails after reading the logData field in StateMachineLogEntryProto.

Possible reasons of corruption which are being explored.
 # Corruption during flush of log.
 # Error related to StateMachineEntry which is part of StateMachineLogEntryProto. This proto contains the stateMachineData.
 # Corruption related to change in RATIS-767. We reuse a byte buffer in this Jira for log write.

> Log entry corrupted with 0 checksum
> -----------------------------------
>
>                 Key: RATIS-815
>                 URL: https://issues.apache.org/jira/browse/RATIS-815
>             Project: Ratis
>          Issue Type: Bug
>    Affects Versions: 0.5.0
>            Reporter: Attila Doroszlai
>            Priority: Blocker
>         Attachments: dumps.tar.gz, logs.tar.gz
>
>
> After writing a few large keys (128MB) with very small chunks size (64KB) in Ozone, Ratis reports log entry corruption due to checksum error:
> {code}
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:396 - e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker: Rolling segment log-62379_62465 to index:62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:541 - e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker: Rolled log segment from /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62379 to /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_62379-62465
> 2020-02-13 12:01:41 INFO  SegmentedRaftLogWorker:583 - e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236-SegmentedRaftLogWorker: created new log segment /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_inprogress_62466
> 2020-02-13 12:01:41 ERROR LogAppender:81 - e5e4fd1e-aa81-48a2-98f9-b1ba24531624@group-B85226EEE236->ac5b3434-874b-4375-8a03-989e8c7fb692-GrpcLogAppender-AppenderDaemon failed RaftLog
> org.apache.ratis.server.raftlog.RaftLogIOException: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated checksum is CDFED097 but read checksum is 00000000.
> 	at org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:311)
> 	at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:292)
> 	at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:297)
> 	at org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:213)
> 	at org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:179)
> 	at org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:122)
> 	at org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:77)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated checksum is CDFED097 but read checksum is 00000000.
> 	at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.decodeEntry(SegmentedRaftLogReader.java:312)
> 	at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogReader.readEntry(SegmentedRaftLogReader.java:194)
> 	at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogInputStream.nextEntry(SegmentedRaftLogInputStream.java:129)
> 	at org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:98)
> 	at org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:202)
> 	at org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:309)
> 	... 7 more
> {code}
> Steps to reproduce:
> 1. Configure Ozone with 64KB chunk size and slightly higher buffer sizes:
>     {code}
> ozone.scm.chunk.size: 64KB
> ozone.client.stream.buffer.flush.size: 256KB
> ozone.client.stream.buffer.max.size: 1MB
> {code}
> 2. Run Freon:
>     {code}
> ozone freon ockg -n 1 -t 1 -p warmup
> ozone freon ockg -p test -t 8 -s 134217728 -n 32
> {code}
> Interestingly, even {{log_5106-5509}} has invalid entry (according to log dump utility):
> {code}
> Processing Raft Log file: /data/metadata/ratis/f89fc072-9ee9-459b-85d1-b85226eee236/current/log_5106-5509 size:1030796
> ...
> (t:1, i:5161), STATEMACHINELOGENTRY, client-296B6A48E40D, cid=3307
> Exception in thread "main" org.apache.ratis.protocol.ChecksumException: Log entry corrupted: Calculated checksum is 926127AE but read checksum is 00000000.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)