You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2011/03/19 00:51:29 UTC

[jira] Created: (HBASE-3674) Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart

Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart
----------------------------------------------------------------------------------------------------------

                 Key: HBASE-3674
                 URL: https://issues.apache.org/jira/browse/HBASE-3674
             Project: HBase
          Issue Type: Bug
          Components: wal
            Reporter: stack
            Priority: Critical
             Fix For: 0.90.2


In short, a ChecksumException will fail log processing for a server so we skip out w/o archiving logs.  On restart, we'll then reprocess the logs -- hit the checksumexception anew, usually -- and so on.

Here is the splitLog method (edited):

{code}
  private List<Path> splitLog(final FileStatus[] logfiles) throws IOException {
    ....
    outputSink.startWriterThreads(entryBuffers);
    
    try {
      int i = 0;
      for (FileStatus log : logfiles) {
       Path logPath = log.getPath();
        long logLength = log.getLen();
        splitSize += logLength;
        LOG.debug("Splitting hlog " + (i++ + 1) + " of " + logfiles.length
            + ": " + logPath + ", length=" + logLength);
        try {
          recoverFileLease(fs, logPath, conf);
          parseHLog(log, entryBuffers, fs, conf);
          processedLogs.add(logPath);
        } catch (EOFException eof) {
          // truncated files are expected if a RS crashes (see HBASE-2643)
          LOG.info("EOF from hlog " + logPath + ". Continuing");
          processedLogs.add(logPath);
        } catch (FileNotFoundException fnfe) {
          // A file may be missing if the region server was able to archive it
          // before shutting down. This means the edits were persisted already
          LOG.info("A log was missing " + logPath +
              ", probably because it was moved by the" +
              " now dead region server. Continuing");
          processedLogs.add(logPath);
        } catch (IOException e) {
          // If the IOE resulted from bad file format,
          // then this problem is idempotent and retrying won't help
          if (e.getCause() instanceof ParseException ||
              e.getCause() instanceof ChecksumException) {
            LOG.warn("ParseException from hlog " + logPath + ".  continuing");
            processedLogs.add(logPath);
          } else {
            if (skipErrors) {
              LOG.info("Got while parsing hlog " + logPath +
                ". Marking as corrupted", e);
              corruptedLogs.add(logPath);
            } else {
              throw e;
            }
          }
        }
      }
      if (fs.listStatus(srcDir).length > processedLogs.size()
          + corruptedLogs.size()) {
        throw new OrphanHLogAfterSplitException(
            "Discovered orphan hlog after split. Maybe the "
            + "HRegionServer was not dead when we started");
      }
      archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf);      
    } finally {
      splits = outputSink.finishWritingAndClose();
    }
    return splits;
  }
{code}

Notice how we'll only archive logs only if we successfully split all logs.  We won't archive 31 of 35 files if we happen to get a checksum exception on file 32.

I think we should treat a ChecksumException the same as a ParseException; a retry will not fix it if HDFS could not get around the ChecksumException (seems like in our case all replicas were corrupt).

Here is a play-by-play from the logs:

{code}
813572 2011-03-18 20:31:44,687 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 34 of 35: hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481, length=150       65662813573 2011-03-18 20:31:44,687 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
....
813617 2011-03-18 20:31:46,238 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[0, 512]=000000cd000000502037383661376439656265643938636463343433386132343631323633303239371d6170695f6163636573735f746f6b656e5f7374       6174735f6275636b65740000000d9fa4d5dc0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002337626262663764626431616561366234616130656334383436653732333132643a32390764656661756c746170695f616e64726f69645f6c6f67676564       696e5f73686172655f70656e64696e675f696e69740000012ec956b02804000000000000000100000000ffffffff4e128eca0eb078d0652b0abac467fd09000000cd000000502034663166613763666165333930666332653138346233393931303132623366331d6170695f6163       636573735f746f6b656e5f73746174735f6275636b65740000000d9fa4d5dd0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002366303734323966643036323862636530336238333938356239316237386633353a32390764656661756c746170695f61       6e64726f69645f6c6f67676564696e5f73686172655f70656e64696e675f696e69740000012ec9569f1804000000000000000100000000000000d30000004e2066663763393964303633343339666531666461633761616632613964643631331b6170695f6163636573735f746f       6b656e5f73746174735f68
813618 org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_7781725413191608261:of:/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481 at 15064576
813619         at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
813620         at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
813621         at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
813622         at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
813623         at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
813624         at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1175)
813625         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1807)
813626         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1859)
813627         at java.io.DataInputStream.read(DataInputStream.java:132)
813628         at java.io.DataInputStream.readFully(DataInputStream.java:178)
813629         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
813630         at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
813631         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
813632         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
813633         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
813634         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
813635         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
813636         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
813637         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
813638         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
813639         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
813640         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
813641         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
813642         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
813643 2011-03-18 20:31:46,239 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_7781725413191608261_14589573 from 10.20.20.182:50010 at 15064576
813644 2011-03-18 20:31:46,240 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_7781725413191608261_14589573 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations        from namenode and retry...
813645 2011-03-18 20:31:49,243 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=80624 entries from hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
....
{code}

See code above.  On exception we'll dump edits read so far from this block, close out all writers tying off recovered.edits so far written.  We'll skip archiving these files because we only archive if all files are processed; we won't archive files 30 of 35 if we failed splitting on file 31.

I think checksumexception should be treated same as a ParseException

  



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3674) Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3674:
-------------------------

      Resolution: Fixed
    Release Note: This commit changes our "hbase.hlog.split.skip.errors" from false to true.  IMPORTANT DIFFERENCE!!!!! We're going to just archive corrupted files and keep going over them rather than stop forward motion.
          Status: Resolved  (was: Patch Available)

Committed branch and trunk.  Thanks for the review Nicolas

> Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3674
>                 URL: https://issues.apache.org/jira/browse/HBASE-3674
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.2
>
>         Attachments: 3674-v2.txt, 3674.txt
>
>
> In short, a ChecksumException will fail log processing for a server so we skip out w/o archiving logs.  On restart, we'll then reprocess the logs -- hit the checksumexception anew, usually -- and so on.
> Here is the splitLog method (edited):
> {code}
>   private List<Path> splitLog(final FileStatus[] logfiles) throws IOException {
>     ....
>     outputSink.startWriterThreads(entryBuffers);
>     
>     try {
>       int i = 0;
>       for (FileStatus log : logfiles) {
>        Path logPath = log.getPath();
>         long logLength = log.getLen();
>         splitSize += logLength;
>         LOG.debug("Splitting hlog " + (i++ + 1) + " of " + logfiles.length
>             + ": " + logPath + ", length=" + logLength);
>         try {
>           recoverFileLease(fs, logPath, conf);
>           parseHLog(log, entryBuffers, fs, conf);
>           processedLogs.add(logPath);
>         } catch (EOFException eof) {
>           // truncated files are expected if a RS crashes (see HBASE-2643)
>           LOG.info("EOF from hlog " + logPath + ". Continuing");
>           processedLogs.add(logPath);
>         } catch (FileNotFoundException fnfe) {
>           // A file may be missing if the region server was able to archive it
>           // before shutting down. This means the edits were persisted already
>           LOG.info("A log was missing " + logPath +
>               ", probably because it was moved by the" +
>               " now dead region server. Continuing");
>           processedLogs.add(logPath);
>         } catch (IOException e) {
>           // If the IOE resulted from bad file format,
>           // then this problem is idempotent and retrying won't help
>           if (e.getCause() instanceof ParseException ||
>               e.getCause() instanceof ChecksumException) {
>             LOG.warn("ParseException from hlog " + logPath + ".  continuing");
>             processedLogs.add(logPath);
>           } else {
>             if (skipErrors) {
>               LOG.info("Got while parsing hlog " + logPath +
>                 ". Marking as corrupted", e);
>               corruptedLogs.add(logPath);
>             } else {
>               throw e;
>             }
>           }
>         }
>       }
>       if (fs.listStatus(srcDir).length > processedLogs.size()
>           + corruptedLogs.size()) {
>         throw new OrphanHLogAfterSplitException(
>             "Discovered orphan hlog after split. Maybe the "
>             + "HRegionServer was not dead when we started");
>       }
>       archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf);      
>     } finally {
>       splits = outputSink.finishWritingAndClose();
>     }
>     return splits;
>   }
> {code}
> Notice how we'll only archive logs only if we successfully split all logs.  We won't archive 31 of 35 files if we happen to get a checksum exception on file 32.
> I think we should treat a ChecksumException the same as a ParseException; a retry will not fix it if HDFS could not get around the ChecksumException (seems like in our case all replicas were corrupt).
> Here is a play-by-play from the logs:
> {code}
> 813572 2011-03-18 20:31:44,687 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 34 of 35: hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481, length=150       65662813573 2011-03-18 20:31:44,687 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> 813617 2011-03-18 20:31:46,238 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[0, 512]=000000cd000000502037383661376439656265643938636463343433386132343631323633303239371d6170695f6163636573735f746f6b656e5f7374       6174735f6275636b65740000000d9fa4d5dc0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002337626262663764626431616561366234616130656334383436653732333132643a32390764656661756c746170695f616e64726f69645f6c6f67676564       696e5f73686172655f70656e64696e675f696e69740000012ec956b02804000000000000000100000000ffffffff4e128eca0eb078d0652b0abac467fd09000000cd000000502034663166613763666165333930666332653138346233393931303132623366331d6170695f6163       636573735f746f6b656e5f73746174735f6275636b65740000000d9fa4d5dd0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002366303734323966643036323862636530336238333938356239316237386633353a32390764656661756c746170695f61       6e64726f69645f6c6f67676564696e5f73686172655f70656e64696e675f696e69740000012ec9569f1804000000000000000100000000000000d30000004e2066663763393964303633343339666531666461633761616632613964643631331b6170695f6163636573735f746f       6b656e5f73746174735f68
> 813618 org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_7781725413191608261:of:/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481 at 15064576
> 813619         at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
> 813620         at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> 813621         at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
> 813622         at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
> 813623         at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
> 813624         at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1175)
> 813625         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1807)
> 813626         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1859)
> 813627         at java.io.DataInputStream.read(DataInputStream.java:132)
> 813628         at java.io.DataInputStream.readFully(DataInputStream.java:178)
> 813629         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
> 813630         at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> 813631         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
> 813632         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
> 813633         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
> 813634         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
> 813635         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
> 813636         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
> 813637         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
> 813638         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
> 813639         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
> 813640         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
> 813641         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
> 813642         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
> 813643 2011-03-18 20:31:46,239 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_7781725413191608261_14589573 from 10.20.20.182:50010 at 15064576
> 813644 2011-03-18 20:31:46,240 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_7781725413191608261_14589573 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations        from namenode and retry...
> 813645 2011-03-18 20:31:49,243 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=80624 entries from hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> {code}
> See code above.  On exception we'll dump edits read so far from this block, close out all writers tying off recovered.edits so far written.  We'll skip archiving these files because we only archive if all files are processed; we won't archive files 30 of 35 if we failed splitting on file 31.
> I think checksumexception should be treated same as a ParseException
>   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3674) Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3674:
-------------------------

    Attachment: 3674-distributed.txt

How about this Prakash?  This adds back the changes and sets default to true for the fail flag.

> Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3674
>                 URL: https://issues.apache.org/jira/browse/HBASE-3674
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.2
>
>         Attachments: 3674-distributed.txt, 3674-v2.txt, 3674.txt
>
>
> In short, a ChecksumException will fail log processing for a server so we skip out w/o archiving logs.  On restart, we'll then reprocess the logs -- hit the checksumexception anew, usually -- and so on.
> Here is the splitLog method (edited):
> {code}
>   private List<Path> splitLog(final FileStatus[] logfiles) throws IOException {
>     ....
>     outputSink.startWriterThreads(entryBuffers);
>     
>     try {
>       int i = 0;
>       for (FileStatus log : logfiles) {
>        Path logPath = log.getPath();
>         long logLength = log.getLen();
>         splitSize += logLength;
>         LOG.debug("Splitting hlog " + (i++ + 1) + " of " + logfiles.length
>             + ": " + logPath + ", length=" + logLength);
>         try {
>           recoverFileLease(fs, logPath, conf);
>           parseHLog(log, entryBuffers, fs, conf);
>           processedLogs.add(logPath);
>         } catch (EOFException eof) {
>           // truncated files are expected if a RS crashes (see HBASE-2643)
>           LOG.info("EOF from hlog " + logPath + ". Continuing");
>           processedLogs.add(logPath);
>         } catch (FileNotFoundException fnfe) {
>           // A file may be missing if the region server was able to archive it
>           // before shutting down. This means the edits were persisted already
>           LOG.info("A log was missing " + logPath +
>               ", probably because it was moved by the" +
>               " now dead region server. Continuing");
>           processedLogs.add(logPath);
>         } catch (IOException e) {
>           // If the IOE resulted from bad file format,
>           // then this problem is idempotent and retrying won't help
>           if (e.getCause() instanceof ParseException ||
>               e.getCause() instanceof ChecksumException) {
>             LOG.warn("ParseException from hlog " + logPath + ".  continuing");
>             processedLogs.add(logPath);
>           } else {
>             if (skipErrors) {
>               LOG.info("Got while parsing hlog " + logPath +
>                 ". Marking as corrupted", e);
>               corruptedLogs.add(logPath);
>             } else {
>               throw e;
>             }
>           }
>         }
>       }
>       if (fs.listStatus(srcDir).length > processedLogs.size()
>           + corruptedLogs.size()) {
>         throw new OrphanHLogAfterSplitException(
>             "Discovered orphan hlog after split. Maybe the "
>             + "HRegionServer was not dead when we started");
>       }
>       archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf);      
>     } finally {
>       splits = outputSink.finishWritingAndClose();
>     }
>     return splits;
>   }
> {code}
> Notice how we'll only archive logs only if we successfully split all logs.  We won't archive 31 of 35 files if we happen to get a checksum exception on file 32.
> I think we should treat a ChecksumException the same as a ParseException; a retry will not fix it if HDFS could not get around the ChecksumException (seems like in our case all replicas were corrupt).
> Here is a play-by-play from the logs:
> {code}
> 813572 2011-03-18 20:31:44,687 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 34 of 35: hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481, length=150       65662813573 2011-03-18 20:31:44,687 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> 813617 2011-03-18 20:31:46,238 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[0, 512]=000000cd000000502037383661376439656265643938636463343433386132343631323633303239371d6170695f6163636573735f746f6b656e5f7374       6174735f6275636b65740000000d9fa4d5dc0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002337626262663764626431616561366234616130656334383436653732333132643a32390764656661756c746170695f616e64726f69645f6c6f67676564       696e5f73686172655f70656e64696e675f696e69740000012ec956b02804000000000000000100000000ffffffff4e128eca0eb078d0652b0abac467fd09000000cd000000502034663166613763666165333930666332653138346233393931303132623366331d6170695f6163       636573735f746f6b656e5f73746174735f6275636b65740000000d9fa4d5dd0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002366303734323966643036323862636530336238333938356239316237386633353a32390764656661756c746170695f61       6e64726f69645f6c6f67676564696e5f73686172655f70656e64696e675f696e69740000012ec9569f1804000000000000000100000000000000d30000004e2066663763393964303633343339666531666461633761616632613964643631331b6170695f6163636573735f746f       6b656e5f73746174735f68
> 813618 org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_7781725413191608261:of:/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481 at 15064576
> 813619         at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
> 813620         at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> 813621         at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
> 813622         at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
> 813623         at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
> 813624         at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1175)
> 813625         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1807)
> 813626         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1859)
> 813627         at java.io.DataInputStream.read(DataInputStream.java:132)
> 813628         at java.io.DataInputStream.readFully(DataInputStream.java:178)
> 813629         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
> 813630         at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> 813631         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
> 813632         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
> 813633         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
> 813634         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
> 813635         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
> 813636         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
> 813637         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
> 813638         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
> 813639         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
> 813640         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
> 813641         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
> 813642         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
> 813643 2011-03-18 20:31:46,239 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_7781725413191608261_14589573 from 10.20.20.182:50010 at 15064576
> 813644 2011-03-18 20:31:46,240 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_7781725413191608261_14589573 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations        from namenode and retry...
> 813645 2011-03-18 20:31:49,243 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=80624 entries from hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> {code}
> See code above.  On exception we'll dump edits read so far from this block, close out all writers tying off recovered.edits so far written.  We'll skip archiving these files because we only archive if all files are processed; we won't archive files 30 of 35 if we failed splitting on file 31.
> I think checksumexception should be treated same as a ParseException
>   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HBASE-3674) Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3674:
-------------------------

    Attachment: 3674.txt

Here is a first cut at a patch.

> Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3674
>                 URL: https://issues.apache.org/jira/browse/HBASE-3674
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: stack
>            Priority: Critical
>             Fix For: 0.90.2
>
>         Attachments: 3674.txt
>
>
> In short, a ChecksumException will fail log processing for a server so we skip out w/o archiving logs.  On restart, we'll then reprocess the logs -- hit the checksumexception anew, usually -- and so on.
> Here is the splitLog method (edited):
> {code}
>   private List<Path> splitLog(final FileStatus[] logfiles) throws IOException {
>     ....
>     outputSink.startWriterThreads(entryBuffers);
>     
>     try {
>       int i = 0;
>       for (FileStatus log : logfiles) {
>        Path logPath = log.getPath();
>         long logLength = log.getLen();
>         splitSize += logLength;
>         LOG.debug("Splitting hlog " + (i++ + 1) + " of " + logfiles.length
>             + ": " + logPath + ", length=" + logLength);
>         try {
>           recoverFileLease(fs, logPath, conf);
>           parseHLog(log, entryBuffers, fs, conf);
>           processedLogs.add(logPath);
>         } catch (EOFException eof) {
>           // truncated files are expected if a RS crashes (see HBASE-2643)
>           LOG.info("EOF from hlog " + logPath + ". Continuing");
>           processedLogs.add(logPath);
>         } catch (FileNotFoundException fnfe) {
>           // A file may be missing if the region server was able to archive it
>           // before shutting down. This means the edits were persisted already
>           LOG.info("A log was missing " + logPath +
>               ", probably because it was moved by the" +
>               " now dead region server. Continuing");
>           processedLogs.add(logPath);
>         } catch (IOException e) {
>           // If the IOE resulted from bad file format,
>           // then this problem is idempotent and retrying won't help
>           if (e.getCause() instanceof ParseException ||
>               e.getCause() instanceof ChecksumException) {
>             LOG.warn("ParseException from hlog " + logPath + ".  continuing");
>             processedLogs.add(logPath);
>           } else {
>             if (skipErrors) {
>               LOG.info("Got while parsing hlog " + logPath +
>                 ". Marking as corrupted", e);
>               corruptedLogs.add(logPath);
>             } else {
>               throw e;
>             }
>           }
>         }
>       }
>       if (fs.listStatus(srcDir).length > processedLogs.size()
>           + corruptedLogs.size()) {
>         throw new OrphanHLogAfterSplitException(
>             "Discovered orphan hlog after split. Maybe the "
>             + "HRegionServer was not dead when we started");
>       }
>       archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf);      
>     } finally {
>       splits = outputSink.finishWritingAndClose();
>     }
>     return splits;
>   }
> {code}
> Notice how we'll only archive logs only if we successfully split all logs.  We won't archive 31 of 35 files if we happen to get a checksum exception on file 32.
> I think we should treat a ChecksumException the same as a ParseException; a retry will not fix it if HDFS could not get around the ChecksumException (seems like in our case all replicas were corrupt).
> Here is a play-by-play from the logs:
> {code}
> 813572 2011-03-18 20:31:44,687 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 34 of 35: hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481, length=150       65662813573 2011-03-18 20:31:44,687 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> 813617 2011-03-18 20:31:46,238 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[0, 512]=000000cd000000502037383661376439656265643938636463343433386132343631323633303239371d6170695f6163636573735f746f6b656e5f7374       6174735f6275636b65740000000d9fa4d5dc0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002337626262663764626431616561366234616130656334383436653732333132643a32390764656661756c746170695f616e64726f69645f6c6f67676564       696e5f73686172655f70656e64696e675f696e69740000012ec956b02804000000000000000100000000ffffffff4e128eca0eb078d0652b0abac467fd09000000cd000000502034663166613763666165333930666332653138346233393931303132623366331d6170695f6163       636573735f746f6b656e5f73746174735f6275636b65740000000d9fa4d5dd0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002366303734323966643036323862636530336238333938356239316237386633353a32390764656661756c746170695f61       6e64726f69645f6c6f67676564696e5f73686172655f70656e64696e675f696e69740000012ec9569f1804000000000000000100000000000000d30000004e2066663763393964303633343339666531666461633761616632613964643631331b6170695f6163636573735f746f       6b656e5f73746174735f68
> 813618 org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_7781725413191608261:of:/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481 at 15064576
> 813619         at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
> 813620         at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> 813621         at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
> 813622         at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
> 813623         at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
> 813624         at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1175)
> 813625         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1807)
> 813626         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1859)
> 813627         at java.io.DataInputStream.read(DataInputStream.java:132)
> 813628         at java.io.DataInputStream.readFully(DataInputStream.java:178)
> 813629         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
> 813630         at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> 813631         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
> 813632         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
> 813633         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
> 813634         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
> 813635         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
> 813636         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
> 813637         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
> 813638         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
> 813639         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
> 813640         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
> 813641         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
> 813642         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
> 813643 2011-03-18 20:31:46,239 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_7781725413191608261_14589573 from 10.20.20.182:50010 at 15064576
> 813644 2011-03-18 20:31:46,240 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_7781725413191608261_14589573 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations        from namenode and retry...
> 813645 2011-03-18 20:31:49,243 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=80624 entries from hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> {code}
> See code above.  On exception we'll dump edits read so far from this block, close out all writers tying off recovered.edits so far written.  We'll skip archiving these files because we only archive if all files are processed; we won't archive files 30 of 35 if we failed splitting on file 31.
> I think checksumexception should be treated same as a ParseException
>   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3674) Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009963#comment-13009963 ] 

Hudson commented on HBASE-3674:
-------------------------------

Integrated in HBase-TRUNK #1803 (See [https://hudson.apache.org/hudson/job/HBase-TRUNK/1803/])
    

> Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3674
>                 URL: https://issues.apache.org/jira/browse/HBASE-3674
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.2
>
>         Attachments: 3674-v2.txt, 3674.txt
>
>
> In short, a ChecksumException will fail log processing for a server so we skip out w/o archiving logs.  On restart, we'll then reprocess the logs -- hit the checksumexception anew, usually -- and so on.
> Here is the splitLog method (edited):
> {code}
>   private List<Path> splitLog(final FileStatus[] logfiles) throws IOException {
>     ....
>     outputSink.startWriterThreads(entryBuffers);
>     
>     try {
>       int i = 0;
>       for (FileStatus log : logfiles) {
>        Path logPath = log.getPath();
>         long logLength = log.getLen();
>         splitSize += logLength;
>         LOG.debug("Splitting hlog " + (i++ + 1) + " of " + logfiles.length
>             + ": " + logPath + ", length=" + logLength);
>         try {
>           recoverFileLease(fs, logPath, conf);
>           parseHLog(log, entryBuffers, fs, conf);
>           processedLogs.add(logPath);
>         } catch (EOFException eof) {
>           // truncated files are expected if a RS crashes (see HBASE-2643)
>           LOG.info("EOF from hlog " + logPath + ". Continuing");
>           processedLogs.add(logPath);
>         } catch (FileNotFoundException fnfe) {
>           // A file may be missing if the region server was able to archive it
>           // before shutting down. This means the edits were persisted already
>           LOG.info("A log was missing " + logPath +
>               ", probably because it was moved by the" +
>               " now dead region server. Continuing");
>           processedLogs.add(logPath);
>         } catch (IOException e) {
>           // If the IOE resulted from bad file format,
>           // then this problem is idempotent and retrying won't help
>           if (e.getCause() instanceof ParseException ||
>               e.getCause() instanceof ChecksumException) {
>             LOG.warn("ParseException from hlog " + logPath + ".  continuing");
>             processedLogs.add(logPath);
>           } else {
>             if (skipErrors) {
>               LOG.info("Got while parsing hlog " + logPath +
>                 ". Marking as corrupted", e);
>               corruptedLogs.add(logPath);
>             } else {
>               throw e;
>             }
>           }
>         }
>       }
>       if (fs.listStatus(srcDir).length > processedLogs.size()
>           + corruptedLogs.size()) {
>         throw new OrphanHLogAfterSplitException(
>             "Discovered orphan hlog after split. Maybe the "
>             + "HRegionServer was not dead when we started");
>       }
>       archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf);      
>     } finally {
>       splits = outputSink.finishWritingAndClose();
>     }
>     return splits;
>   }
> {code}
> Notice how we'll only archive logs only if we successfully split all logs.  We won't archive 31 of 35 files if we happen to get a checksum exception on file 32.
> I think we should treat a ChecksumException the same as a ParseException; a retry will not fix it if HDFS could not get around the ChecksumException (seems like in our case all replicas were corrupt).
> Here is a play-by-play from the logs:
> {code}
> 813572 2011-03-18 20:31:44,687 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 34 of 35: hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481, length=150       65662813573 2011-03-18 20:31:44,687 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> 813617 2011-03-18 20:31:46,238 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[0, 512]=000000cd000000502037383661376439656265643938636463343433386132343631323633303239371d6170695f6163636573735f746f6b656e5f7374       6174735f6275636b65740000000d9fa4d5dc0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002337626262663764626431616561366234616130656334383436653732333132643a32390764656661756c746170695f616e64726f69645f6c6f67676564       696e5f73686172655f70656e64696e675f696e69740000012ec956b02804000000000000000100000000ffffffff4e128eca0eb078d0652b0abac467fd09000000cd000000502034663166613763666165333930666332653138346233393931303132623366331d6170695f6163       636573735f746f6b656e5f73746174735f6275636b65740000000d9fa4d5dd0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002366303734323966643036323862636530336238333938356239316237386633353a32390764656661756c746170695f61       6e64726f69645f6c6f67676564696e5f73686172655f70656e64696e675f696e69740000012ec9569f1804000000000000000100000000000000d30000004e2066663763393964303633343339666531666461633761616632613964643631331b6170695f6163636573735f746f       6b656e5f73746174735f68
> 813618 org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_7781725413191608261:of:/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481 at 15064576
> 813619         at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
> 813620         at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> 813621         at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
> 813622         at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
> 813623         at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
> 813624         at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1175)
> 813625         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1807)
> 813626         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1859)
> 813627         at java.io.DataInputStream.read(DataInputStream.java:132)
> 813628         at java.io.DataInputStream.readFully(DataInputStream.java:178)
> 813629         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
> 813630         at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> 813631         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
> 813632         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
> 813633         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
> 813634         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
> 813635         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
> 813636         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
> 813637         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
> 813638         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
> 813639         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
> 813640         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
> 813641         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
> 813642         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
> 813643 2011-03-18 20:31:46,239 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_7781725413191608261_14589573 from 10.20.20.182:50010 at 15064576
> 813644 2011-03-18 20:31:46,240 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_7781725413191608261_14589573 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations        from namenode and retry...
> 813645 2011-03-18 20:31:49,243 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=80624 entries from hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> {code}
> See code above.  On exception we'll dump edits read so far from this block, close out all writers tying off recovered.edits so far written.  We'll skip archiving these files because we only archive if all files are processed; we won't archive files 30 of 35 if we failed splitting on file 31.
> I think checksumexception should be treated same as a ParseException
>   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3674) Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart

Posted by "Prakash Khemani (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022873#comment-13022873 ] 

Prakash Khemani commented on HBASE-3674:
----------------------------------------

This change got overwritten when HBASE-1364 was integrated.

The change has to be added in HLogSplitter  in the method getNextLogLine

static private Entry getNextLogLine(Reader in, Path path, boolean skipErrors)
  throws CorruptedLogFileException, IOException {
    try {
      return in.next();
    } catch (EOFException eof) {
      // truncated files are expected if a RS crashes (see HBASE-2643)
      LOG.info("EOF from hlog " + path + ".  continuing");
      return null;
    } catch (IOException e) {
      // If the IOE resulted from bad file format,
      // then this problem is idempotent and retrying won't help
      if (e.getCause() instanceof ParseException) {
        LOG.warn("ParseException from hlog " + path + ".  continuing");
        return null;
      }


It might also be necessary to add this change to getReader(...) method

  protected Reader getReader(FileSystem fs, FileStatus file, Configuration conf,
      boolean skipErrors)
      throws IOException, CorruptedLogFileException {
    Path path = file.getPath();
    long length = file.getLen();
    Reader in;


    // Check for possibly empty file. With appends, currently Hadoop reports a
    // zero length even if the file has been sync'd. Revisit if HDFS-376 or
    // HDFS-878 is committed.
    if (length <= 0) {
      LOG.warn("File " + path + " might be still open, length is 0");
    }

    try {
      recoverFileLease(fs, path, conf);
      try {
        in = getReader(fs, path, conf);
      } catch (EOFException e) {
        if (length <= 0) {
          // TODO should we ignore an empty, not-last log file if skip.errors
          // is false? Either way, the caller should decide what to do. E.g.
          // ignore if this is the last log in sequence.
          // TODO is this scenario still possible if the log has been
          // recovered (i.e. closed)
          LOG.warn("Could not open " + path + " for reading. File is empty", e);
          return null;




> Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3674
>                 URL: https://issues.apache.org/jira/browse/HBASE-3674
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.2
>
>         Attachments: 3674-v2.txt, 3674.txt
>
>
> In short, a ChecksumException will fail log processing for a server so we skip out w/o archiving logs.  On restart, we'll then reprocess the logs -- hit the checksumexception anew, usually -- and so on.
> Here is the splitLog method (edited):
> {code}
>   private List<Path> splitLog(final FileStatus[] logfiles) throws IOException {
>     ....
>     outputSink.startWriterThreads(entryBuffers);
>     
>     try {
>       int i = 0;
>       for (FileStatus log : logfiles) {
>        Path logPath = log.getPath();
>         long logLength = log.getLen();
>         splitSize += logLength;
>         LOG.debug("Splitting hlog " + (i++ + 1) + " of " + logfiles.length
>             + ": " + logPath + ", length=" + logLength);
>         try {
>           recoverFileLease(fs, logPath, conf);
>           parseHLog(log, entryBuffers, fs, conf);
>           processedLogs.add(logPath);
>         } catch (EOFException eof) {
>           // truncated files are expected if a RS crashes (see HBASE-2643)
>           LOG.info("EOF from hlog " + logPath + ". Continuing");
>           processedLogs.add(logPath);
>         } catch (FileNotFoundException fnfe) {
>           // A file may be missing if the region server was able to archive it
>           // before shutting down. This means the edits were persisted already
>           LOG.info("A log was missing " + logPath +
>               ", probably because it was moved by the" +
>               " now dead region server. Continuing");
>           processedLogs.add(logPath);
>         } catch (IOException e) {
>           // If the IOE resulted from bad file format,
>           // then this problem is idempotent and retrying won't help
>           if (e.getCause() instanceof ParseException ||
>               e.getCause() instanceof ChecksumException) {
>             LOG.warn("ParseException from hlog " + logPath + ".  continuing");
>             processedLogs.add(logPath);
>           } else {
>             if (skipErrors) {
>               LOG.info("Got while parsing hlog " + logPath +
>                 ". Marking as corrupted", e);
>               corruptedLogs.add(logPath);
>             } else {
>               throw e;
>             }
>           }
>         }
>       }
>       if (fs.listStatus(srcDir).length > processedLogs.size()
>           + corruptedLogs.size()) {
>         throw new OrphanHLogAfterSplitException(
>             "Discovered orphan hlog after split. Maybe the "
>             + "HRegionServer was not dead when we started");
>       }
>       archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf);      
>     } finally {
>       splits = outputSink.finishWritingAndClose();
>     }
>     return splits;
>   }
> {code}
> Notice how we'll only archive logs only if we successfully split all logs.  We won't archive 31 of 35 files if we happen to get a checksum exception on file 32.
> I think we should treat a ChecksumException the same as a ParseException; a retry will not fix it if HDFS could not get around the ChecksumException (seems like in our case all replicas were corrupt).
> Here is a play-by-play from the logs:
> {code}
> 813572 2011-03-18 20:31:44,687 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 34 of 35: hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481, length=150       65662813573 2011-03-18 20:31:44,687 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> 813617 2011-03-18 20:31:46,238 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[0, 512]=000000cd000000502037383661376439656265643938636463343433386132343631323633303239371d6170695f6163636573735f746f6b656e5f7374       6174735f6275636b65740000000d9fa4d5dc0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002337626262663764626431616561366234616130656334383436653732333132643a32390764656661756c746170695f616e64726f69645f6c6f67676564       696e5f73686172655f70656e64696e675f696e69740000012ec956b02804000000000000000100000000ffffffff4e128eca0eb078d0652b0abac467fd09000000cd000000502034663166613763666165333930666332653138346233393931303132623366331d6170695f6163       636573735f746f6b656e5f73746174735f6275636b65740000000d9fa4d5dd0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002366303734323966643036323862636530336238333938356239316237386633353a32390764656661756c746170695f61       6e64726f69645f6c6f67676564696e5f73686172655f70656e64696e675f696e69740000012ec9569f1804000000000000000100000000000000d30000004e2066663763393964303633343339666531666461633761616632613964643631331b6170695f6163636573735f746f       6b656e5f73746174735f68
> 813618 org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_7781725413191608261:of:/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481 at 15064576
> 813619         at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
> 813620         at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> 813621         at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
> 813622         at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
> 813623         at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
> 813624         at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1175)
> 813625         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1807)
> 813626         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1859)
> 813627         at java.io.DataInputStream.read(DataInputStream.java:132)
> 813628         at java.io.DataInputStream.readFully(DataInputStream.java:178)
> 813629         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
> 813630         at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> 813631         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
> 813632         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
> 813633         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
> 813634         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
> 813635         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
> 813636         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
> 813637         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
> 813638         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
> 813639         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
> 813640         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
> 813641         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
> 813642         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
> 813643 2011-03-18 20:31:46,239 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_7781725413191608261_14589573 from 10.20.20.182:50010 at 15064576
> 813644 2011-03-18 20:31:46,240 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_7781725413191608261_14589573 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations        from namenode and retry...
> 813645 2011-03-18 20:31:49,243 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=80624 entries from hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> {code}
> See code above.  On exception we'll dump edits read so far from this block, close out all writers tying off recovered.edits so far written.  We'll skip archiving these files because we only archive if all files are processed; we won't archive files 30 of 35 if we failed splitting on file 31.
> I think checksumexception should be treated same as a ParseException
>   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HBASE-3674) Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3674:
-------------------------

    Assignee: stack
      Status: Patch Available  (was: Open)

Submitting patch, needs a test though.

> Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3674
>                 URL: https://issues.apache.org/jira/browse/HBASE-3674
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.2
>
>         Attachments: 3674.txt
>
>
> In short, a ChecksumException will fail log processing for a server so we skip out w/o archiving logs.  On restart, we'll then reprocess the logs -- hit the checksumexception anew, usually -- and so on.
> Here is the splitLog method (edited):
> {code}
>   private List<Path> splitLog(final FileStatus[] logfiles) throws IOException {
>     ....
>     outputSink.startWriterThreads(entryBuffers);
>     
>     try {
>       int i = 0;
>       for (FileStatus log : logfiles) {
>        Path logPath = log.getPath();
>         long logLength = log.getLen();
>         splitSize += logLength;
>         LOG.debug("Splitting hlog " + (i++ + 1) + " of " + logfiles.length
>             + ": " + logPath + ", length=" + logLength);
>         try {
>           recoverFileLease(fs, logPath, conf);
>           parseHLog(log, entryBuffers, fs, conf);
>           processedLogs.add(logPath);
>         } catch (EOFException eof) {
>           // truncated files are expected if a RS crashes (see HBASE-2643)
>           LOG.info("EOF from hlog " + logPath + ". Continuing");
>           processedLogs.add(logPath);
>         } catch (FileNotFoundException fnfe) {
>           // A file may be missing if the region server was able to archive it
>           // before shutting down. This means the edits were persisted already
>           LOG.info("A log was missing " + logPath +
>               ", probably because it was moved by the" +
>               " now dead region server. Continuing");
>           processedLogs.add(logPath);
>         } catch (IOException e) {
>           // If the IOE resulted from bad file format,
>           // then this problem is idempotent and retrying won't help
>           if (e.getCause() instanceof ParseException ||
>               e.getCause() instanceof ChecksumException) {
>             LOG.warn("ParseException from hlog " + logPath + ".  continuing");
>             processedLogs.add(logPath);
>           } else {
>             if (skipErrors) {
>               LOG.info("Got while parsing hlog " + logPath +
>                 ". Marking as corrupted", e);
>               corruptedLogs.add(logPath);
>             } else {
>               throw e;
>             }
>           }
>         }
>       }
>       if (fs.listStatus(srcDir).length > processedLogs.size()
>           + corruptedLogs.size()) {
>         throw new OrphanHLogAfterSplitException(
>             "Discovered orphan hlog after split. Maybe the "
>             + "HRegionServer was not dead when we started");
>       }
>       archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf);      
>     } finally {
>       splits = outputSink.finishWritingAndClose();
>     }
>     return splits;
>   }
> {code}
> Notice how we'll only archive logs only if we successfully split all logs.  We won't archive 31 of 35 files if we happen to get a checksum exception on file 32.
> I think we should treat a ChecksumException the same as a ParseException; a retry will not fix it if HDFS could not get around the ChecksumException (seems like in our case all replicas were corrupt).
> Here is a play-by-play from the logs:
> {code}
> 813572 2011-03-18 20:31:44,687 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 34 of 35: hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481, length=150       65662813573 2011-03-18 20:31:44,687 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> 813617 2011-03-18 20:31:46,238 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[0, 512]=000000cd000000502037383661376439656265643938636463343433386132343631323633303239371d6170695f6163636573735f746f6b656e5f7374       6174735f6275636b65740000000d9fa4d5dc0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002337626262663764626431616561366234616130656334383436653732333132643a32390764656661756c746170695f616e64726f69645f6c6f67676564       696e5f73686172655f70656e64696e675f696e69740000012ec956b02804000000000000000100000000ffffffff4e128eca0eb078d0652b0abac467fd09000000cd000000502034663166613763666165333930666332653138346233393931303132623366331d6170695f6163       636573735f746f6b656e5f73746174735f6275636b65740000000d9fa4d5dd0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002366303734323966643036323862636530336238333938356239316237386633353a32390764656661756c746170695f61       6e64726f69645f6c6f67676564696e5f73686172655f70656e64696e675f696e69740000012ec9569f1804000000000000000100000000000000d30000004e2066663763393964303633343339666531666461633761616632613964643631331b6170695f6163636573735f746f       6b656e5f73746174735f68
> 813618 org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_7781725413191608261:of:/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481 at 15064576
> 813619         at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
> 813620         at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> 813621         at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
> 813622         at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
> 813623         at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
> 813624         at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1175)
> 813625         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1807)
> 813626         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1859)
> 813627         at java.io.DataInputStream.read(DataInputStream.java:132)
> 813628         at java.io.DataInputStream.readFully(DataInputStream.java:178)
> 813629         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
> 813630         at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> 813631         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
> 813632         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
> 813633         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
> 813634         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
> 813635         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
> 813636         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
> 813637         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
> 813638         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
> 813639         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
> 813640         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
> 813641         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
> 813642         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
> 813643 2011-03-18 20:31:46,239 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_7781725413191608261_14589573 from 10.20.20.182:50010 at 15064576
> 813644 2011-03-18 20:31:46,240 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_7781725413191608261_14589573 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations        from namenode and retry...
> 813645 2011-03-18 20:31:49,243 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=80624 entries from hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> {code}
> See code above.  On exception we'll dump edits read so far from this block, close out all writers tying off recovered.edits so far written.  We'll skip archiving these files because we only archive if all files are processed; we won't archive files 30 of 35 if we failed splitting on file 31.
> I think checksumexception should be treated same as a ParseException
>   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3674) Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3674:
-------------------------

    Attachment: 3674-v2.txt

> Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3674
>                 URL: https://issues.apache.org/jira/browse/HBASE-3674
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.2
>
>         Attachments: 3674-v2.txt, 3674.txt
>
>
> In short, a ChecksumException will fail log processing for a server so we skip out w/o archiving logs.  On restart, we'll then reprocess the logs -- hit the checksumexception anew, usually -- and so on.
> Here is the splitLog method (edited):
> {code}
>   private List<Path> splitLog(final FileStatus[] logfiles) throws IOException {
>     ....
>     outputSink.startWriterThreads(entryBuffers);
>     
>     try {
>       int i = 0;
>       for (FileStatus log : logfiles) {
>        Path logPath = log.getPath();
>         long logLength = log.getLen();
>         splitSize += logLength;
>         LOG.debug("Splitting hlog " + (i++ + 1) + " of " + logfiles.length
>             + ": " + logPath + ", length=" + logLength);
>         try {
>           recoverFileLease(fs, logPath, conf);
>           parseHLog(log, entryBuffers, fs, conf);
>           processedLogs.add(logPath);
>         } catch (EOFException eof) {
>           // truncated files are expected if a RS crashes (see HBASE-2643)
>           LOG.info("EOF from hlog " + logPath + ". Continuing");
>           processedLogs.add(logPath);
>         } catch (FileNotFoundException fnfe) {
>           // A file may be missing if the region server was able to archive it
>           // before shutting down. This means the edits were persisted already
>           LOG.info("A log was missing " + logPath +
>               ", probably because it was moved by the" +
>               " now dead region server. Continuing");
>           processedLogs.add(logPath);
>         } catch (IOException e) {
>           // If the IOE resulted from bad file format,
>           // then this problem is idempotent and retrying won't help
>           if (e.getCause() instanceof ParseException ||
>               e.getCause() instanceof ChecksumException) {
>             LOG.warn("ParseException from hlog " + logPath + ".  continuing");
>             processedLogs.add(logPath);
>           } else {
>             if (skipErrors) {
>               LOG.info("Got while parsing hlog " + logPath +
>                 ". Marking as corrupted", e);
>               corruptedLogs.add(logPath);
>             } else {
>               throw e;
>             }
>           }
>         }
>       }
>       if (fs.listStatus(srcDir).length > processedLogs.size()
>           + corruptedLogs.size()) {
>         throw new OrphanHLogAfterSplitException(
>             "Discovered orphan hlog after split. Maybe the "
>             + "HRegionServer was not dead when we started");
>       }
>       archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf);      
>     } finally {
>       splits = outputSink.finishWritingAndClose();
>     }
>     return splits;
>   }
> {code}
> Notice how we'll only archive logs only if we successfully split all logs.  We won't archive 31 of 35 files if we happen to get a checksum exception on file 32.
> I think we should treat a ChecksumException the same as a ParseException; a retry will not fix it if HDFS could not get around the ChecksumException (seems like in our case all replicas were corrupt).
> Here is a play-by-play from the logs:
> {code}
> 813572 2011-03-18 20:31:44,687 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 34 of 35: hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481, length=150       65662813573 2011-03-18 20:31:44,687 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> 813617 2011-03-18 20:31:46,238 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[0, 512]=000000cd000000502037383661376439656265643938636463343433386132343631323633303239371d6170695f6163636573735f746f6b656e5f7374       6174735f6275636b65740000000d9fa4d5dc0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002337626262663764626431616561366234616130656334383436653732333132643a32390764656661756c746170695f616e64726f69645f6c6f67676564       696e5f73686172655f70656e64696e675f696e69740000012ec956b02804000000000000000100000000ffffffff4e128eca0eb078d0652b0abac467fd09000000cd000000502034663166613763666165333930666332653138346233393931303132623366331d6170695f6163       636573735f746f6b656e5f73746174735f6275636b65740000000d9fa4d5dd0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002366303734323966643036323862636530336238333938356239316237386633353a32390764656661756c746170695f61       6e64726f69645f6c6f67676564696e5f73686172655f70656e64696e675f696e69740000012ec9569f1804000000000000000100000000000000d30000004e2066663763393964303633343339666531666461633761616632613964643631331b6170695f6163636573735f746f       6b656e5f73746174735f68
> 813618 org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_7781725413191608261:of:/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481 at 15064576
> 813619         at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
> 813620         at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> 813621         at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
> 813622         at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
> 813623         at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
> 813624         at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1175)
> 813625         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1807)
> 813626         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1859)
> 813627         at java.io.DataInputStream.read(DataInputStream.java:132)
> 813628         at java.io.DataInputStream.readFully(DataInputStream.java:178)
> 813629         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
> 813630         at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> 813631         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
> 813632         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
> 813633         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
> 813634         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
> 813635         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
> 813636         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
> 813637         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
> 813638         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
> 813639         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
> 813640         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
> 813641         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
> 813642         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
> 813643 2011-03-18 20:31:46,239 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_7781725413191608261_14589573 from 10.20.20.182:50010 at 15064576
> 813644 2011-03-18 20:31:46,240 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_7781725413191608261_14589573 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations        from namenode and retry...
> 813645 2011-03-18 20:31:49,243 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=80624 entries from hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> {code}
> See code above.  On exception we'll dump edits read so far from this block, close out all writers tying off recovered.edits so far written.  We'll skip archiving these files because we only archive if all files are processed; we won't archive files 30 of 35 if we failed splitting on file 31.
> I think checksumexception should be treated same as a ParseException
>   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3674) Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025366#comment-13025366 ] 

stack commented on HBASE-3674:
------------------------------

Committed to TRUNK.  Thanks for review Prakash.

> Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3674
>                 URL: https://issues.apache.org/jira/browse/HBASE-3674
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.2
>
>         Attachments: 3674-distributed.txt, 3674-v2.txt, 3674.txt
>
>
> In short, a ChecksumException will fail log processing for a server so we skip out w/o archiving logs.  On restart, we'll then reprocess the logs -- hit the checksumexception anew, usually -- and so on.
> Here is the splitLog method (edited):
> {code}
>   private List<Path> splitLog(final FileStatus[] logfiles) throws IOException {
>     ....
>     outputSink.startWriterThreads(entryBuffers);
>     
>     try {
>       int i = 0;
>       for (FileStatus log : logfiles) {
>        Path logPath = log.getPath();
>         long logLength = log.getLen();
>         splitSize += logLength;
>         LOG.debug("Splitting hlog " + (i++ + 1) + " of " + logfiles.length
>             + ": " + logPath + ", length=" + logLength);
>         try {
>           recoverFileLease(fs, logPath, conf);
>           parseHLog(log, entryBuffers, fs, conf);
>           processedLogs.add(logPath);
>         } catch (EOFException eof) {
>           // truncated files are expected if a RS crashes (see HBASE-2643)
>           LOG.info("EOF from hlog " + logPath + ". Continuing");
>           processedLogs.add(logPath);
>         } catch (FileNotFoundException fnfe) {
>           // A file may be missing if the region server was able to archive it
>           // before shutting down. This means the edits were persisted already
>           LOG.info("A log was missing " + logPath +
>               ", probably because it was moved by the" +
>               " now dead region server. Continuing");
>           processedLogs.add(logPath);
>         } catch (IOException e) {
>           // If the IOE resulted from bad file format,
>           // then this problem is idempotent and retrying won't help
>           if (e.getCause() instanceof ParseException ||
>               e.getCause() instanceof ChecksumException) {
>             LOG.warn("ParseException from hlog " + logPath + ".  continuing");
>             processedLogs.add(logPath);
>           } else {
>             if (skipErrors) {
>               LOG.info("Got while parsing hlog " + logPath +
>                 ". Marking as corrupted", e);
>               corruptedLogs.add(logPath);
>             } else {
>               throw e;
>             }
>           }
>         }
>       }
>       if (fs.listStatus(srcDir).length > processedLogs.size()
>           + corruptedLogs.size()) {
>         throw new OrphanHLogAfterSplitException(
>             "Discovered orphan hlog after split. Maybe the "
>             + "HRegionServer was not dead when we started");
>       }
>       archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf);      
>     } finally {
>       splits = outputSink.finishWritingAndClose();
>     }
>     return splits;
>   }
> {code}
> Notice how we'll only archive logs only if we successfully split all logs.  We won't archive 31 of 35 files if we happen to get a checksum exception on file 32.
> I think we should treat a ChecksumException the same as a ParseException; a retry will not fix it if HDFS could not get around the ChecksumException (seems like in our case all replicas were corrupt).
> Here is a play-by-play from the logs:
> {code}
> 813572 2011-03-18 20:31:44,687 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 34 of 35: hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481, length=150       65662813573 2011-03-18 20:31:44,687 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> 813617 2011-03-18 20:31:46,238 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[0, 512]=000000cd000000502037383661376439656265643938636463343433386132343631323633303239371d6170695f6163636573735f746f6b656e5f7374       6174735f6275636b65740000000d9fa4d5dc0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002337626262663764626431616561366234616130656334383436653732333132643a32390764656661756c746170695f616e64726f69645f6c6f67676564       696e5f73686172655f70656e64696e675f696e69740000012ec956b02804000000000000000100000000ffffffff4e128eca0eb078d0652b0abac467fd09000000cd000000502034663166613763666165333930666332653138346233393931303132623366331d6170695f6163       636573735f746f6b656e5f73746174735f6275636b65740000000d9fa4d5dd0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002366303734323966643036323862636530336238333938356239316237386633353a32390764656661756c746170695f61       6e64726f69645f6c6f67676564696e5f73686172655f70656e64696e675f696e69740000012ec9569f1804000000000000000100000000000000d30000004e2066663763393964303633343339666531666461633761616632613964643631331b6170695f6163636573735f746f       6b656e5f73746174735f68
> 813618 org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_7781725413191608261:of:/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481 at 15064576
> 813619         at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
> 813620         at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> 813621         at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
> 813622         at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
> 813623         at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
> 813624         at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1175)
> 813625         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1807)
> 813626         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1859)
> 813627         at java.io.DataInputStream.read(DataInputStream.java:132)
> 813628         at java.io.DataInputStream.readFully(DataInputStream.java:178)
> 813629         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
> 813630         at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> 813631         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
> 813632         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
> 813633         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
> 813634         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
> 813635         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
> 813636         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
> 813637         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
> 813638         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
> 813639         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
> 813640         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
> 813641         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
> 813642         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
> 813643 2011-03-18 20:31:46,239 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_7781725413191608261_14589573 from 10.20.20.182:50010 at 15064576
> 813644 2011-03-18 20:31:46,240 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_7781725413191608261_14589573 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations        from namenode and retry...
> 813645 2011-03-18 20:31:49,243 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=80624 entries from hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> {code}
> See code above.  On exception we'll dump edits read so far from this block, close out all writers tying off recovered.edits so far written.  We'll skip archiving these files because we only archive if all files are processed; we won't archive files 30 of 35 if we failed splitting on file 31.
> I think checksumexception should be treated same as a ParseException
>   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3674) Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13023042#comment-13023042 ] 

Todd Lipcon commented on HBASE-3674:
------------------------------------

I actually missed when this went in. Having the default be to skip errors seems really really dangerous.... especially as a change in a release branch.

> Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3674
>                 URL: https://issues.apache.org/jira/browse/HBASE-3674
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.2
>
>         Attachments: 3674-v2.txt, 3674.txt
>
>
> In short, a ChecksumException will fail log processing for a server so we skip out w/o archiving logs.  On restart, we'll then reprocess the logs -- hit the checksumexception anew, usually -- and so on.
> Here is the splitLog method (edited):
> {code}
>   private List<Path> splitLog(final FileStatus[] logfiles) throws IOException {
>     ....
>     outputSink.startWriterThreads(entryBuffers);
>     
>     try {
>       int i = 0;
>       for (FileStatus log : logfiles) {
>        Path logPath = log.getPath();
>         long logLength = log.getLen();
>         splitSize += logLength;
>         LOG.debug("Splitting hlog " + (i++ + 1) + " of " + logfiles.length
>             + ": " + logPath + ", length=" + logLength);
>         try {
>           recoverFileLease(fs, logPath, conf);
>           parseHLog(log, entryBuffers, fs, conf);
>           processedLogs.add(logPath);
>         } catch (EOFException eof) {
>           // truncated files are expected if a RS crashes (see HBASE-2643)
>           LOG.info("EOF from hlog " + logPath + ". Continuing");
>           processedLogs.add(logPath);
>         } catch (FileNotFoundException fnfe) {
>           // A file may be missing if the region server was able to archive it
>           // before shutting down. This means the edits were persisted already
>           LOG.info("A log was missing " + logPath +
>               ", probably because it was moved by the" +
>               " now dead region server. Continuing");
>           processedLogs.add(logPath);
>         } catch (IOException e) {
>           // If the IOE resulted from bad file format,
>           // then this problem is idempotent and retrying won't help
>           if (e.getCause() instanceof ParseException ||
>               e.getCause() instanceof ChecksumException) {
>             LOG.warn("ParseException from hlog " + logPath + ".  continuing");
>             processedLogs.add(logPath);
>           } else {
>             if (skipErrors) {
>               LOG.info("Got while parsing hlog " + logPath +
>                 ". Marking as corrupted", e);
>               corruptedLogs.add(logPath);
>             } else {
>               throw e;
>             }
>           }
>         }
>       }
>       if (fs.listStatus(srcDir).length > processedLogs.size()
>           + corruptedLogs.size()) {
>         throw new OrphanHLogAfterSplitException(
>             "Discovered orphan hlog after split. Maybe the "
>             + "HRegionServer was not dead when we started");
>       }
>       archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf);      
>     } finally {
>       splits = outputSink.finishWritingAndClose();
>     }
>     return splits;
>   }
> {code}
> Notice how we'll only archive logs only if we successfully split all logs.  We won't archive 31 of 35 files if we happen to get a checksum exception on file 32.
> I think we should treat a ChecksumException the same as a ParseException; a retry will not fix it if HDFS could not get around the ChecksumException (seems like in our case all replicas were corrupt).
> Here is a play-by-play from the logs:
> {code}
> 813572 2011-03-18 20:31:44,687 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 34 of 35: hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481, length=150       65662813573 2011-03-18 20:31:44,687 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> 813617 2011-03-18 20:31:46,238 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[0, 512]=000000cd000000502037383661376439656265643938636463343433386132343631323633303239371d6170695f6163636573735f746f6b656e5f7374       6174735f6275636b65740000000d9fa4d5dc0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002337626262663764626431616561366234616130656334383436653732333132643a32390764656661756c746170695f616e64726f69645f6c6f67676564       696e5f73686172655f70656e64696e675f696e69740000012ec956b02804000000000000000100000000ffffffff4e128eca0eb078d0652b0abac467fd09000000cd000000502034663166613763666165333930666332653138346233393931303132623366331d6170695f6163       636573735f746f6b656e5f73746174735f6275636b65740000000d9fa4d5dd0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002366303734323966643036323862636530336238333938356239316237386633353a32390764656661756c746170695f61       6e64726f69645f6c6f67676564696e5f73686172655f70656e64696e675f696e69740000012ec9569f1804000000000000000100000000000000d30000004e2066663763393964303633343339666531666461633761616632613964643631331b6170695f6163636573735f746f       6b656e5f73746174735f68
> 813618 org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_7781725413191608261:of:/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481 at 15064576
> 813619         at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
> 813620         at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> 813621         at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
> 813622         at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
> 813623         at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
> 813624         at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1175)
> 813625         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1807)
> 813626         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1859)
> 813627         at java.io.DataInputStream.read(DataInputStream.java:132)
> 813628         at java.io.DataInputStream.readFully(DataInputStream.java:178)
> 813629         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
> 813630         at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> 813631         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
> 813632         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
> 813633         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
> 813634         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
> 813635         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
> 813636         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
> 813637         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
> 813638         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
> 813639         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
> 813640         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
> 813641         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
> 813642         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
> 813643 2011-03-18 20:31:46,239 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_7781725413191608261_14589573 from 10.20.20.182:50010 at 15064576
> 813644 2011-03-18 20:31:46,240 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_7781725413191608261_14589573 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations        from namenode and retry...
> 813645 2011-03-18 20:31:49,243 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=80624 entries from hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> {code}
> See code above.  On exception we'll dump edits read so far from this block, close out all writers tying off recovered.edits so far written.  We'll skip archiving these files because we only archive if all files are processed; we won't archive files 30 of 35 if we failed splitting on file 31.
> I think checksumexception should be treated same as a ParseException
>   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3674) Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027259#comment-13027259 ] 

Hudson commented on HBASE-3674:
-------------------------------

Integrated in HBase-TRUNK #1888 (See [https://builds.apache.org/hudson/job/HBase-TRUNK/1888/])
    

> Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3674
>                 URL: https://issues.apache.org/jira/browse/HBASE-3674
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.2
>
>         Attachments: 3674-distributed.txt, 3674-v2.txt, 3674.txt
>
>
> In short, a ChecksumException will fail log processing for a server so we skip out w/o archiving logs.  On restart, we'll then reprocess the logs -- hit the checksumexception anew, usually -- and so on.
> Here is the splitLog method (edited):
> {code}
>   private List<Path> splitLog(final FileStatus[] logfiles) throws IOException {
>     ....
>     outputSink.startWriterThreads(entryBuffers);
>     
>     try {
>       int i = 0;
>       for (FileStatus log : logfiles) {
>        Path logPath = log.getPath();
>         long logLength = log.getLen();
>         splitSize += logLength;
>         LOG.debug("Splitting hlog " + (i++ + 1) + " of " + logfiles.length
>             + ": " + logPath + ", length=" + logLength);
>         try {
>           recoverFileLease(fs, logPath, conf);
>           parseHLog(log, entryBuffers, fs, conf);
>           processedLogs.add(logPath);
>         } catch (EOFException eof) {
>           // truncated files are expected if a RS crashes (see HBASE-2643)
>           LOG.info("EOF from hlog " + logPath + ". Continuing");
>           processedLogs.add(logPath);
>         } catch (FileNotFoundException fnfe) {
>           // A file may be missing if the region server was able to archive it
>           // before shutting down. This means the edits were persisted already
>           LOG.info("A log was missing " + logPath +
>               ", probably because it was moved by the" +
>               " now dead region server. Continuing");
>           processedLogs.add(logPath);
>         } catch (IOException e) {
>           // If the IOE resulted from bad file format,
>           // then this problem is idempotent and retrying won't help
>           if (e.getCause() instanceof ParseException ||
>               e.getCause() instanceof ChecksumException) {
>             LOG.warn("ParseException from hlog " + logPath + ".  continuing");
>             processedLogs.add(logPath);
>           } else {
>             if (skipErrors) {
>               LOG.info("Got while parsing hlog " + logPath +
>                 ". Marking as corrupted", e);
>               corruptedLogs.add(logPath);
>             } else {
>               throw e;
>             }
>           }
>         }
>       }
>       if (fs.listStatus(srcDir).length > processedLogs.size()
>           + corruptedLogs.size()) {
>         throw new OrphanHLogAfterSplitException(
>             "Discovered orphan hlog after split. Maybe the "
>             + "HRegionServer was not dead when we started");
>       }
>       archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf);      
>     } finally {
>       splits = outputSink.finishWritingAndClose();
>     }
>     return splits;
>   }
> {code}
> Notice how we'll only archive logs only if we successfully split all logs.  We won't archive 31 of 35 files if we happen to get a checksum exception on file 32.
> I think we should treat a ChecksumException the same as a ParseException; a retry will not fix it if HDFS could not get around the ChecksumException (seems like in our case all replicas were corrupt).
> Here is a play-by-play from the logs:
> {code}
> 813572 2011-03-18 20:31:44,687 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 34 of 35: hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481, length=150       65662813573 2011-03-18 20:31:44,687 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> 813617 2011-03-18 20:31:46,238 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[0, 512]=000000cd000000502037383661376439656265643938636463343433386132343631323633303239371d6170695f6163636573735f746f6b656e5f7374       6174735f6275636b65740000000d9fa4d5dc0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002337626262663764626431616561366234616130656334383436653732333132643a32390764656661756c746170695f616e64726f69645f6c6f67676564       696e5f73686172655f70656e64696e675f696e69740000012ec956b02804000000000000000100000000ffffffff4e128eca0eb078d0652b0abac467fd09000000cd000000502034663166613763666165333930666332653138346233393931303132623366331d6170695f6163       636573735f746f6b656e5f73746174735f6275636b65740000000d9fa4d5dd0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002366303734323966643036323862636530336238333938356239316237386633353a32390764656661756c746170695f61       6e64726f69645f6c6f67676564696e5f73686172655f70656e64696e675f696e69740000012ec9569f1804000000000000000100000000000000d30000004e2066663763393964303633343339666531666461633761616632613964643631331b6170695f6163636573735f746f       6b656e5f73746174735f68
> 813618 org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_7781725413191608261:of:/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481 at 15064576
> 813619         at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
> 813620         at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> 813621         at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
> 813622         at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
> 813623         at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
> 813624         at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1175)
> 813625         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1807)
> 813626         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1859)
> 813627         at java.io.DataInputStream.read(DataInputStream.java:132)
> 813628         at java.io.DataInputStream.readFully(DataInputStream.java:178)
> 813629         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
> 813630         at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> 813631         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
> 813632         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
> 813633         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
> 813634         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
> 813635         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
> 813636         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
> 813637         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
> 813638         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
> 813639         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
> 813640         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
> 813641         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
> 813642         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
> 813643 2011-03-18 20:31:46,239 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_7781725413191608261_14589573 from 10.20.20.182:50010 at 15064576
> 813644 2011-03-18 20:31:46,240 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_7781725413191608261_14589573 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations        from namenode and retry...
> 813645 2011-03-18 20:31:49,243 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=80624 entries from hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> {code}
> See code above.  On exception we'll dump edits read so far from this block, close out all writers tying off recovered.edits so far written.  We'll skip archiving these files because we only archive if all files are processed; we won't archive files 30 of 35 if we failed splitting on file 31.
> I think checksumexception should be treated same as a ParseException
>   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3674) Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025008#comment-13025008 ] 

stack commented on HBASE-3674:
------------------------------

bq. I actually missed when this went in. Having the default be to skip errors seems really really dangerous.... especially as a change in a release branch.

Default was true but we would keep going anyways; it was incorrectly implemented.  I opened HBASE-3675 at the time.  Suggest that we continue the discussion around how dangerous changing the config is/was over there.  In here I'll just make sure the distributed splitter has same flags as the splitter it replaces.

Thanks Todd for helping keeping us honest.

> Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3674
>                 URL: https://issues.apache.org/jira/browse/HBASE-3674
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.2
>
>         Attachments: 3674-v2.txt, 3674.txt
>
>
> In short, a ChecksumException will fail log processing for a server so we skip out w/o archiving logs.  On restart, we'll then reprocess the logs -- hit the checksumexception anew, usually -- and so on.
> Here is the splitLog method (edited):
> {code}
>   private List<Path> splitLog(final FileStatus[] logfiles) throws IOException {
>     ....
>     outputSink.startWriterThreads(entryBuffers);
>     
>     try {
>       int i = 0;
>       for (FileStatus log : logfiles) {
>        Path logPath = log.getPath();
>         long logLength = log.getLen();
>         splitSize += logLength;
>         LOG.debug("Splitting hlog " + (i++ + 1) + " of " + logfiles.length
>             + ": " + logPath + ", length=" + logLength);
>         try {
>           recoverFileLease(fs, logPath, conf);
>           parseHLog(log, entryBuffers, fs, conf);
>           processedLogs.add(logPath);
>         } catch (EOFException eof) {
>           // truncated files are expected if a RS crashes (see HBASE-2643)
>           LOG.info("EOF from hlog " + logPath + ". Continuing");
>           processedLogs.add(logPath);
>         } catch (FileNotFoundException fnfe) {
>           // A file may be missing if the region server was able to archive it
>           // before shutting down. This means the edits were persisted already
>           LOG.info("A log was missing " + logPath +
>               ", probably because it was moved by the" +
>               " now dead region server. Continuing");
>           processedLogs.add(logPath);
>         } catch (IOException e) {
>           // If the IOE resulted from bad file format,
>           // then this problem is idempotent and retrying won't help
>           if (e.getCause() instanceof ParseException ||
>               e.getCause() instanceof ChecksumException) {
>             LOG.warn("ParseException from hlog " + logPath + ".  continuing");
>             processedLogs.add(logPath);
>           } else {
>             if (skipErrors) {
>               LOG.info("Got while parsing hlog " + logPath +
>                 ". Marking as corrupted", e);
>               corruptedLogs.add(logPath);
>             } else {
>               throw e;
>             }
>           }
>         }
>       }
>       if (fs.listStatus(srcDir).length > processedLogs.size()
>           + corruptedLogs.size()) {
>         throw new OrphanHLogAfterSplitException(
>             "Discovered orphan hlog after split. Maybe the "
>             + "HRegionServer was not dead when we started");
>       }
>       archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf);      
>     } finally {
>       splits = outputSink.finishWritingAndClose();
>     }
>     return splits;
>   }
> {code}
> Notice how we'll only archive logs only if we successfully split all logs.  We won't archive 31 of 35 files if we happen to get a checksum exception on file 32.
> I think we should treat a ChecksumException the same as a ParseException; a retry will not fix it if HDFS could not get around the ChecksumException (seems like in our case all replicas were corrupt).
> Here is a play-by-play from the logs:
> {code}
> 813572 2011-03-18 20:31:44,687 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 34 of 35: hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481, length=150       65662813573 2011-03-18 20:31:44,687 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> 813617 2011-03-18 20:31:46,238 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[0, 512]=000000cd000000502037383661376439656265643938636463343433386132343631323633303239371d6170695f6163636573735f746f6b656e5f7374       6174735f6275636b65740000000d9fa4d5dc0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002337626262663764626431616561366234616130656334383436653732333132643a32390764656661756c746170695f616e64726f69645f6c6f67676564       696e5f73686172655f70656e64696e675f696e69740000012ec956b02804000000000000000100000000ffffffff4e128eca0eb078d0652b0abac467fd09000000cd000000502034663166613763666165333930666332653138346233393931303132623366331d6170695f6163       636573735f746f6b656e5f73746174735f6275636b65740000000d9fa4d5dd0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002366303734323966643036323862636530336238333938356239316237386633353a32390764656661756c746170695f61       6e64726f69645f6c6f67676564696e5f73686172655f70656e64696e675f696e69740000012ec9569f1804000000000000000100000000000000d30000004e2066663763393964303633343339666531666461633761616632613964643631331b6170695f6163636573735f746f       6b656e5f73746174735f68
> 813618 org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_7781725413191608261:of:/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481 at 15064576
> 813619         at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
> 813620         at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> 813621         at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
> 813622         at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
> 813623         at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
> 813624         at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1175)
> 813625         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1807)
> 813626         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1859)
> 813627         at java.io.DataInputStream.read(DataInputStream.java:132)
> 813628         at java.io.DataInputStream.readFully(DataInputStream.java:178)
> 813629         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
> 813630         at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> 813631         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
> 813632         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
> 813633         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
> 813634         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
> 813635         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
> 813636         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
> 813637         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
> 813638         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
> 813639         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
> 813640         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
> 813641         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
> 813642         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
> 813643 2011-03-18 20:31:46,239 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_7781725413191608261_14589573 from 10.20.20.182:50010 at 15064576
> 813644 2011-03-18 20:31:46,240 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_7781725413191608261_14589573 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations        from namenode and retry...
> 813645 2011-03-18 20:31:49,243 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=80624 entries from hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> {code}
> See code above.  On exception we'll dump edits read so far from this block, close out all writers tying off recovered.edits so far written.  We'll skip archiving these files because we only archive if all files are processed; we won't archive files 30 of 35 if we failed splitting on file 31.
> I think checksumexception should be treated same as a ParseException
>   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3674) Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart

Posted by "Prakash Khemani (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025071#comment-13025071 ] 

Prakash Khemani commented on HBASE-3674:
----------------------------------------

+1

> Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3674
>                 URL: https://issues.apache.org/jira/browse/HBASE-3674
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.2
>
>         Attachments: 3674-distributed.txt, 3674-v2.txt, 3674.txt
>
>
> In short, a ChecksumException will fail log processing for a server so we skip out w/o archiving logs.  On restart, we'll then reprocess the logs -- hit the checksumexception anew, usually -- and so on.
> Here is the splitLog method (edited):
> {code}
>   private List<Path> splitLog(final FileStatus[] logfiles) throws IOException {
>     ....
>     outputSink.startWriterThreads(entryBuffers);
>     
>     try {
>       int i = 0;
>       for (FileStatus log : logfiles) {
>        Path logPath = log.getPath();
>         long logLength = log.getLen();
>         splitSize += logLength;
>         LOG.debug("Splitting hlog " + (i++ + 1) + " of " + logfiles.length
>             + ": " + logPath + ", length=" + logLength);
>         try {
>           recoverFileLease(fs, logPath, conf);
>           parseHLog(log, entryBuffers, fs, conf);
>           processedLogs.add(logPath);
>         } catch (EOFException eof) {
>           // truncated files are expected if a RS crashes (see HBASE-2643)
>           LOG.info("EOF from hlog " + logPath + ". Continuing");
>           processedLogs.add(logPath);
>         } catch (FileNotFoundException fnfe) {
>           // A file may be missing if the region server was able to archive it
>           // before shutting down. This means the edits were persisted already
>           LOG.info("A log was missing " + logPath +
>               ", probably because it was moved by the" +
>               " now dead region server. Continuing");
>           processedLogs.add(logPath);
>         } catch (IOException e) {
>           // If the IOE resulted from bad file format,
>           // then this problem is idempotent and retrying won't help
>           if (e.getCause() instanceof ParseException ||
>               e.getCause() instanceof ChecksumException) {
>             LOG.warn("ParseException from hlog " + logPath + ".  continuing");
>             processedLogs.add(logPath);
>           } else {
>             if (skipErrors) {
>               LOG.info("Got while parsing hlog " + logPath +
>                 ". Marking as corrupted", e);
>               corruptedLogs.add(logPath);
>             } else {
>               throw e;
>             }
>           }
>         }
>       }
>       if (fs.listStatus(srcDir).length > processedLogs.size()
>           + corruptedLogs.size()) {
>         throw new OrphanHLogAfterSplitException(
>             "Discovered orphan hlog after split. Maybe the "
>             + "HRegionServer was not dead when we started");
>       }
>       archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf);      
>     } finally {
>       splits = outputSink.finishWritingAndClose();
>     }
>     return splits;
>   }
> {code}
> Notice how we'll only archive logs only if we successfully split all logs.  We won't archive 31 of 35 files if we happen to get a checksum exception on file 32.
> I think we should treat a ChecksumException the same as a ParseException; a retry will not fix it if HDFS could not get around the ChecksumException (seems like in our case all replicas were corrupt).
> Here is a play-by-play from the logs:
> {code}
> 813572 2011-03-18 20:31:44,687 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 34 of 35: hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481, length=150       65662813573 2011-03-18 20:31:44,687 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> 813617 2011-03-18 20:31:46,238 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[0, 512]=000000cd000000502037383661376439656265643938636463343433386132343631323633303239371d6170695f6163636573735f746f6b656e5f7374       6174735f6275636b65740000000d9fa4d5dc0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002337626262663764626431616561366234616130656334383436653732333132643a32390764656661756c746170695f616e64726f69645f6c6f67676564       696e5f73686172655f70656e64696e675f696e69740000012ec956b02804000000000000000100000000ffffffff4e128eca0eb078d0652b0abac467fd09000000cd000000502034663166613763666165333930666332653138346233393931303132623366331d6170695f6163       636573735f746f6b656e5f73746174735f6275636b65740000000d9fa4d5dd0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002366303734323966643036323862636530336238333938356239316237386633353a32390764656661756c746170695f61       6e64726f69645f6c6f67676564696e5f73686172655f70656e64696e675f696e69740000012ec9569f1804000000000000000100000000000000d30000004e2066663763393964303633343339666531666461633761616632613964643631331b6170695f6163636573735f746f       6b656e5f73746174735f68
> 813618 org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_7781725413191608261:of:/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481 at 15064576
> 813619         at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
> 813620         at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> 813621         at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
> 813622         at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
> 813623         at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
> 813624         at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1175)
> 813625         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1807)
> 813626         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1859)
> 813627         at java.io.DataInputStream.read(DataInputStream.java:132)
> 813628         at java.io.DataInputStream.readFully(DataInputStream.java:178)
> 813629         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
> 813630         at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> 813631         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
> 813632         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
> 813633         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
> 813634         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
> 813635         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
> 813636         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
> 813637         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
> 813638         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
> 813639         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
> 813640         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
> 813641         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
> 813642         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
> 813643 2011-03-18 20:31:46,239 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_7781725413191608261_14589573 from 10.20.20.182:50010 at 15064576
> 813644 2011-03-18 20:31:46,240 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_7781725413191608261_14589573 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations        from namenode and retry...
> 813645 2011-03-18 20:31:49,243 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=80624 entries from hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> {code}
> See code above.  On exception we'll dump edits read so far from this block, close out all writers tying off recovered.edits so far written.  We'll skip archiving these files because we only archive if all files are processed; we won't archive files 30 of 35 if we failed splitting on file 31.
> I think checksumexception should be treated same as a ParseException
>   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3674) Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025051#comment-13025051 ] 

stack commented on HBASE-3674:
------------------------------

bq. The patch sets the hbase.hlog.split.skip.errors to true by default. I am wondering why the CheckSumException was not ignored as originally proposed?

Prakash, there are two patches.  The first adds ignoring checksumexception.

> Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3674
>                 URL: https://issues.apache.org/jira/browse/HBASE-3674
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.2
>
>         Attachments: 3674-v2.txt, 3674.txt
>
>
> In short, a ChecksumException will fail log processing for a server so we skip out w/o archiving logs.  On restart, we'll then reprocess the logs -- hit the checksumexception anew, usually -- and so on.
> Here is the splitLog method (edited):
> {code}
>   private List<Path> splitLog(final FileStatus[] logfiles) throws IOException {
>     ....
>     outputSink.startWriterThreads(entryBuffers);
>     
>     try {
>       int i = 0;
>       for (FileStatus log : logfiles) {
>        Path logPath = log.getPath();
>         long logLength = log.getLen();
>         splitSize += logLength;
>         LOG.debug("Splitting hlog " + (i++ + 1) + " of " + logfiles.length
>             + ": " + logPath + ", length=" + logLength);
>         try {
>           recoverFileLease(fs, logPath, conf);
>           parseHLog(log, entryBuffers, fs, conf);
>           processedLogs.add(logPath);
>         } catch (EOFException eof) {
>           // truncated files are expected if a RS crashes (see HBASE-2643)
>           LOG.info("EOF from hlog " + logPath + ". Continuing");
>           processedLogs.add(logPath);
>         } catch (FileNotFoundException fnfe) {
>           // A file may be missing if the region server was able to archive it
>           // before shutting down. This means the edits were persisted already
>           LOG.info("A log was missing " + logPath +
>               ", probably because it was moved by the" +
>               " now dead region server. Continuing");
>           processedLogs.add(logPath);
>         } catch (IOException e) {
>           // If the IOE resulted from bad file format,
>           // then this problem is idempotent and retrying won't help
>           if (e.getCause() instanceof ParseException ||
>               e.getCause() instanceof ChecksumException) {
>             LOG.warn("ParseException from hlog " + logPath + ".  continuing");
>             processedLogs.add(logPath);
>           } else {
>             if (skipErrors) {
>               LOG.info("Got while parsing hlog " + logPath +
>                 ". Marking as corrupted", e);
>               corruptedLogs.add(logPath);
>             } else {
>               throw e;
>             }
>           }
>         }
>       }
>       if (fs.listStatus(srcDir).length > processedLogs.size()
>           + corruptedLogs.size()) {
>         throw new OrphanHLogAfterSplitException(
>             "Discovered orphan hlog after split. Maybe the "
>             + "HRegionServer was not dead when we started");
>       }
>       archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf);      
>     } finally {
>       splits = outputSink.finishWritingAndClose();
>     }
>     return splits;
>   }
> {code}
> Notice how we'll only archive logs only if we successfully split all logs.  We won't archive 31 of 35 files if we happen to get a checksum exception on file 32.
> I think we should treat a ChecksumException the same as a ParseException; a retry will not fix it if HDFS could not get around the ChecksumException (seems like in our case all replicas were corrupt).
> Here is a play-by-play from the logs:
> {code}
> 813572 2011-03-18 20:31:44,687 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 34 of 35: hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481, length=150       65662813573 2011-03-18 20:31:44,687 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> 813617 2011-03-18 20:31:46,238 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[0, 512]=000000cd000000502037383661376439656265643938636463343433386132343631323633303239371d6170695f6163636573735f746f6b656e5f7374       6174735f6275636b65740000000d9fa4d5dc0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002337626262663764626431616561366234616130656334383436653732333132643a32390764656661756c746170695f616e64726f69645f6c6f67676564       696e5f73686172655f70656e64696e675f696e69740000012ec956b02804000000000000000100000000ffffffff4e128eca0eb078d0652b0abac467fd09000000cd000000502034663166613763666165333930666332653138346233393931303132623366331d6170695f6163       636573735f746f6b656e5f73746174735f6275636b65740000000d9fa4d5dd0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002366303734323966643036323862636530336238333938356239316237386633353a32390764656661756c746170695f61       6e64726f69645f6c6f67676564696e5f73686172655f70656e64696e675f696e69740000012ec9569f1804000000000000000100000000000000d30000004e2066663763393964303633343339666531666461633761616632613964643631331b6170695f6163636573735f746f       6b656e5f73746174735f68
> 813618 org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_7781725413191608261:of:/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481 at 15064576
> 813619         at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
> 813620         at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> 813621         at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
> 813622         at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
> 813623         at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
> 813624         at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1175)
> 813625         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1807)
> 813626         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1859)
> 813627         at java.io.DataInputStream.read(DataInputStream.java:132)
> 813628         at java.io.DataInputStream.readFully(DataInputStream.java:178)
> 813629         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
> 813630         at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> 813631         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
> 813632         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
> 813633         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
> 813634         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
> 813635         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
> 813636         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
> 813637         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
> 813638         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
> 813639         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
> 813640         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
> 813641         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
> 813642         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
> 813643 2011-03-18 20:31:46,239 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_7781725413191608261_14589573 from 10.20.20.182:50010 at 15064576
> 813644 2011-03-18 20:31:46,240 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_7781725413191608261_14589573 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations        from namenode and retry...
> 813645 2011-03-18 20:31:49,243 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=80624 entries from hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> {code}
> See code above.  On exception we'll dump edits read so far from this block, close out all writers tying off recovered.edits so far written.  We'll skip archiving these files because we only archive if all files are processed; we won't archive files 30 of 35 if we failed splitting on file 31.
> I think checksumexception should be treated same as a ParseException
>   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3674) Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart

Posted by "Nicolas Spiegelberg (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009514#comment-13009514 ] 

Nicolas Spiegelberg commented on HBASE-3674:
--------------------------------------------

+1

> Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3674
>                 URL: https://issues.apache.org/jira/browse/HBASE-3674
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.2
>
>         Attachments: 3674-v2.txt, 3674.txt
>
>
> In short, a ChecksumException will fail log processing for a server so we skip out w/o archiving logs.  On restart, we'll then reprocess the logs -- hit the checksumexception anew, usually -- and so on.
> Here is the splitLog method (edited):
> {code}
>   private List<Path> splitLog(final FileStatus[] logfiles) throws IOException {
>     ....
>     outputSink.startWriterThreads(entryBuffers);
>     
>     try {
>       int i = 0;
>       for (FileStatus log : logfiles) {
>        Path logPath = log.getPath();
>         long logLength = log.getLen();
>         splitSize += logLength;
>         LOG.debug("Splitting hlog " + (i++ + 1) + " of " + logfiles.length
>             + ": " + logPath + ", length=" + logLength);
>         try {
>           recoverFileLease(fs, logPath, conf);
>           parseHLog(log, entryBuffers, fs, conf);
>           processedLogs.add(logPath);
>         } catch (EOFException eof) {
>           // truncated files are expected if a RS crashes (see HBASE-2643)
>           LOG.info("EOF from hlog " + logPath + ". Continuing");
>           processedLogs.add(logPath);
>         } catch (FileNotFoundException fnfe) {
>           // A file may be missing if the region server was able to archive it
>           // before shutting down. This means the edits were persisted already
>           LOG.info("A log was missing " + logPath +
>               ", probably because it was moved by the" +
>               " now dead region server. Continuing");
>           processedLogs.add(logPath);
>         } catch (IOException e) {
>           // If the IOE resulted from bad file format,
>           // then this problem is idempotent and retrying won't help
>           if (e.getCause() instanceof ParseException ||
>               e.getCause() instanceof ChecksumException) {
>             LOG.warn("ParseException from hlog " + logPath + ".  continuing");
>             processedLogs.add(logPath);
>           } else {
>             if (skipErrors) {
>               LOG.info("Got while parsing hlog " + logPath +
>                 ". Marking as corrupted", e);
>               corruptedLogs.add(logPath);
>             } else {
>               throw e;
>             }
>           }
>         }
>       }
>       if (fs.listStatus(srcDir).length > processedLogs.size()
>           + corruptedLogs.size()) {
>         throw new OrphanHLogAfterSplitException(
>             "Discovered orphan hlog after split. Maybe the "
>             + "HRegionServer was not dead when we started");
>       }
>       archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf);      
>     } finally {
>       splits = outputSink.finishWritingAndClose();
>     }
>     return splits;
>   }
> {code}
> Notice how we'll only archive logs only if we successfully split all logs.  We won't archive 31 of 35 files if we happen to get a checksum exception on file 32.
> I think we should treat a ChecksumException the same as a ParseException; a retry will not fix it if HDFS could not get around the ChecksumException (seems like in our case all replicas were corrupt).
> Here is a play-by-play from the logs:
> {code}
> 813572 2011-03-18 20:31:44,687 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 34 of 35: hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481, length=150       65662813573 2011-03-18 20:31:44,687 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> 813617 2011-03-18 20:31:46,238 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[0, 512]=000000cd000000502037383661376439656265643938636463343433386132343631323633303239371d6170695f6163636573735f746f6b656e5f7374       6174735f6275636b65740000000d9fa4d5dc0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002337626262663764626431616561366234616130656334383436653732333132643a32390764656661756c746170695f616e64726f69645f6c6f67676564       696e5f73686172655f70656e64696e675f696e69740000012ec956b02804000000000000000100000000ffffffff4e128eca0eb078d0652b0abac467fd09000000cd000000502034663166613763666165333930666332653138346233393931303132623366331d6170695f6163       636573735f746f6b656e5f73746174735f6275636b65740000000d9fa4d5dd0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002366303734323966643036323862636530336238333938356239316237386633353a32390764656661756c746170695f61       6e64726f69645f6c6f67676564696e5f73686172655f70656e64696e675f696e69740000012ec9569f1804000000000000000100000000000000d30000004e2066663763393964303633343339666531666461633761616632613964643631331b6170695f6163636573735f746f       6b656e5f73746174735f68
> 813618 org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_7781725413191608261:of:/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481 at 15064576
> 813619         at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
> 813620         at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> 813621         at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
> 813622         at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
> 813623         at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
> 813624         at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1175)
> 813625         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1807)
> 813626         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1859)
> 813627         at java.io.DataInputStream.read(DataInputStream.java:132)
> 813628         at java.io.DataInputStream.readFully(DataInputStream.java:178)
> 813629         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
> 813630         at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> 813631         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
> 813632         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
> 813633         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
> 813634         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
> 813635         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
> 813636         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
> 813637         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
> 813638         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
> 813639         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
> 813640         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
> 813641         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
> 813642         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
> 813643 2011-03-18 20:31:46,239 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_7781725413191608261_14589573 from 10.20.20.182:50010 at 15064576
> 813644 2011-03-18 20:31:46,240 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_7781725413191608261_14589573 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations        from namenode and retry...
> 813645 2011-03-18 20:31:49,243 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=80624 entries from hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> {code}
> See code above.  On exception we'll dump edits read so far from this block, close out all writers tying off recovered.edits so far written.  We'll skip archiving these files because we only archive if all files are processed; we won't archive files 30 of 35 if we failed splitting on file 31.
> I think checksumexception should be treated same as a ParseException
>   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3674) Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart

Posted by "Prakash Khemani (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022964#comment-13022964 ] 

Prakash Khemani commented on HBASE-3674:
----------------------------------------

The patch sets the hbase.hlog.split.skip.errors to true by default. I am wondering why the CheckSumException was not ignored as originally proposed?

This patch is there in the trunk. In the serialized log splitting case hbase.hlog.split.skip.errors is set to true. But in the distributed log splitting case hbase.hlog.split.skip.errors is set to false by default.

> Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3674
>                 URL: https://issues.apache.org/jira/browse/HBASE-3674
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.2
>
>         Attachments: 3674-v2.txt, 3674.txt
>
>
> In short, a ChecksumException will fail log processing for a server so we skip out w/o archiving logs.  On restart, we'll then reprocess the logs -- hit the checksumexception anew, usually -- and so on.
> Here is the splitLog method (edited):
> {code}
>   private List<Path> splitLog(final FileStatus[] logfiles) throws IOException {
>     ....
>     outputSink.startWriterThreads(entryBuffers);
>     
>     try {
>       int i = 0;
>       for (FileStatus log : logfiles) {
>        Path logPath = log.getPath();
>         long logLength = log.getLen();
>         splitSize += logLength;
>         LOG.debug("Splitting hlog " + (i++ + 1) + " of " + logfiles.length
>             + ": " + logPath + ", length=" + logLength);
>         try {
>           recoverFileLease(fs, logPath, conf);
>           parseHLog(log, entryBuffers, fs, conf);
>           processedLogs.add(logPath);
>         } catch (EOFException eof) {
>           // truncated files are expected if a RS crashes (see HBASE-2643)
>           LOG.info("EOF from hlog " + logPath + ". Continuing");
>           processedLogs.add(logPath);
>         } catch (FileNotFoundException fnfe) {
>           // A file may be missing if the region server was able to archive it
>           // before shutting down. This means the edits were persisted already
>           LOG.info("A log was missing " + logPath +
>               ", probably because it was moved by the" +
>               " now dead region server. Continuing");
>           processedLogs.add(logPath);
>         } catch (IOException e) {
>           // If the IOE resulted from bad file format,
>           // then this problem is idempotent and retrying won't help
>           if (e.getCause() instanceof ParseException ||
>               e.getCause() instanceof ChecksumException) {
>             LOG.warn("ParseException from hlog " + logPath + ".  continuing");
>             processedLogs.add(logPath);
>           } else {
>             if (skipErrors) {
>               LOG.info("Got while parsing hlog " + logPath +
>                 ". Marking as corrupted", e);
>               corruptedLogs.add(logPath);
>             } else {
>               throw e;
>             }
>           }
>         }
>       }
>       if (fs.listStatus(srcDir).length > processedLogs.size()
>           + corruptedLogs.size()) {
>         throw new OrphanHLogAfterSplitException(
>             "Discovered orphan hlog after split. Maybe the "
>             + "HRegionServer was not dead when we started");
>       }
>       archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf);      
>     } finally {
>       splits = outputSink.finishWritingAndClose();
>     }
>     return splits;
>   }
> {code}
> Notice how we'll only archive logs only if we successfully split all logs.  We won't archive 31 of 35 files if we happen to get a checksum exception on file 32.
> I think we should treat a ChecksumException the same as a ParseException; a retry will not fix it if HDFS could not get around the ChecksumException (seems like in our case all replicas were corrupt).
> Here is a play-by-play from the logs:
> {code}
> 813572 2011-03-18 20:31:44,687 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 34 of 35: hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481, length=150       65662813573 2011-03-18 20:31:44,687 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> 813617 2011-03-18 20:31:46,238 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[0, 512]=000000cd000000502037383661376439656265643938636463343433386132343631323633303239371d6170695f6163636573735f746f6b656e5f7374       6174735f6275636b65740000000d9fa4d5dc0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002337626262663764626431616561366234616130656334383436653732333132643a32390764656661756c746170695f616e64726f69645f6c6f67676564       696e5f73686172655f70656e64696e675f696e69740000012ec956b02804000000000000000100000000ffffffff4e128eca0eb078d0652b0abac467fd09000000cd000000502034663166613763666165333930666332653138346233393931303132623366331d6170695f6163       636573735f746f6b656e5f73746174735f6275636b65740000000d9fa4d5dd0000012ec9c7cbaf00ffffffff000000010000006d0000005d00000008002366303734323966643036323862636530336238333938356239316237386633353a32390764656661756c746170695f61       6e64726f69645f6c6f67676564696e5f73686172655f70656e64696e675f696e69740000012ec9569f1804000000000000000100000000000000d30000004e2066663763393964303633343339666531666461633761616632613964643631331b6170695f6163636573735f746f       6b656e5f73746174735f68
> 813618 org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_7781725413191608261:of:/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481 at 15064576
> 813619         at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
> 813620         at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> 813621         at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
> 813622         at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
> 813623         at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
> 813624         at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1175)
> 813625         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1807)
> 813626         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1859)
> 813627         at java.io.DataInputStream.read(DataInputStream.java:132)
> 813628         at java.io.DataInputStream.readFully(DataInputStream.java:178)
> 813629         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
> 813630         at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> 813631         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
> 813632         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
> 813633         at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1883)
> 813634         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
> 813635         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
> 813636         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
> 813637         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
> 813638         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
> 813639         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
> 813640         at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
> 813641         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
> 813642         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
> 813643 2011-03-18 20:31:46,239 WARN org.apache.hadoop.hdfs.DFSClient: Found Checksum error for blk_7781725413191608261_14589573 from 10.20.20.182:50010 at 15064576
> 813644 2011-03-18 20:31:46,240 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_7781725413191608261_14589573 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations        from namenode and retry...
> 813645 2011-03-18 20:31:49,243 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=80624 entries from hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
> ....
> {code}
> See code above.  On exception we'll dump edits read so far from this block, close out all writers tying off recovered.edits so far written.  We'll skip archiving these files because we only archive if all files are processed; we won't archive files 30 of 35 if we failed splitting on file 31.
> I think checksumexception should be treated same as a ParseException
>   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira