You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Patrick Hunt (JIRA)" <ji...@apache.org> on 2012/06/24 06:38:43 UTC

[jira] [Comment Edited] (ZOOKEEPER-1453) corrupted logs may not be correctly identified by FileTxnIterator

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399853#comment-13399853 ] 

Patrick Hunt edited comment on ZOOKEEPER-1453 at 6/24/12 4:37 AM:
------------------------------------------------------------------

Patrick, thanks for taking the time to explain.. I read throw the other bug and your explanation is very clear. I'd like to work on a fix for this as it's hitting us very frequently with a stress test we do where we continually reboot one of our machines that is hosting one of our zk servers. Anyhow, I'm looking at the FileTxnIterator code, and I definitely see the bug in next() method in that it always assumes EOF is success. Have you given thought to the right solution here? Maybe giving precedence to validating CRC before checking for EOF? 

What do you think about this:

{noformat}
public boolean next() throws IOException {
    if (ia == null) {
        return false;
    }
    try {
        long crcValue = ia.readLong("crcvalue");
        byte[] bytes = Util.readTxnBytes(ia);
        // validate CRC
        Checksum crc = makeChecksumAlgorithm();
        if (bytes) {
            crc.update(bytes, 0, bytes.length);
        }
        if (crcValue != crc.getValue())
            throw new IOException(CRC_ERROR);
        if (bytes == null || bytes.length == 0)
            throw new EOFException("Failed to read " + logFile);
        hdr = new TxnHeader();
        record = SerializeUtils.deserializeTxn(bytes, hdr);
    } catch (EOFException e) {
    ...
{noformat}

                
      was (Author: marshall):
    Patrick, thanks for taking the time to explain.. I read throw the other bug and your explanation is very clear. I'd like to work on a fix for this as it's hitting us very frequently with a stress test we do where we continually reboot one of our machines that is hosting one of our zk servers. Anyhow, I'm looking at the FileTxnIterator code, and I definitely see the bug in next() method in that it always assumes EOF is success. Have you given thought to the right solution here? Maybe giving precedence to validating CRC before checking for EOF? 

What do you think about this:

public boolean next() throws IOException {
    if (ia == null) {
        return false;
    }
    try {
        long crcValue = ia.readLong("crcvalue");
        byte[] bytes = Util.readTxnBytes(ia);
        // validate CRC
        Checksum crc = makeChecksumAlgorithm();
        if (bytes) {
            crc.update(bytes, 0, bytes.length);
        }
        if (crcValue != crc.getValue())
            throw new IOException(CRC_ERROR);
        if (bytes == null || bytes.length == 0)
            throw new EOFException("Failed to read " + logFile);
        hdr = new TxnHeader();
        record = SerializeUtils.deserializeTxn(bytes, hdr);
    } catch (EOFException e) {
    ...
                  
> corrupted logs may not be correctly identified by FileTxnIterator
> -----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1453
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1453
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Patrick Hunt
>            Priority: Critical
>
> See ZOOKEEPER-1449 for background on this issue. The main problem is that during server recovery org.apache.zookeeper.server.persistence.FileTxnLog.FileTxnIterator.next() does not indicate if the available logs are valid or not. In some cases (say a truncated record and a single txnlog in the datadir) we will not detect that the file is corrupt, vs reaching the end of the file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira