You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2016/11/07 22:29:59 UTC

[jira] [Resolved] (KUDU-1670) Crash when bootstrapping recovery wal

     [ https://issues.apache.org/jira/browse/KUDU-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon resolved KUDU-1670.
-------------------------------
       Resolution: Duplicate
    Fix Version/s: n/a

> Crash when bootstrapping recovery wal
> -------------------------------------
>
>                 Key: KUDU-1670
>                 URL: https://issues.apache.org/jira/browse/KUDU-1670
>             Project: Kudu
>          Issue Type: Bug
>          Components: tablet
>    Affects Versions: 1.0.0
>            Reporter: Dan Burkert
>            Assignee: Todd Lipcon
>            Priority: Critical
>             Fix For: n/a
>
>         Attachments: alter_table-randomized-test.txt.gz
>
>
> {{alter_table-randomized-itest}} turned up an interesting crash in the tablet bootstrap / wal recovery codepath.  Relevant log lines (full log attached):
> {code}
> I0930 09:20:45.003783 17173 ts_tablet_manager.cc:632] T fd6c30c11e8f488aa7eac04a76b5a246 P cfa92dcbfd8f41798ec6d6fce44e6c2e: Bootstrapping tablet
> I0930 09:20:45.008419 17173 tablet_bootstrap.cc:380] T fd6c30c11e8f488aa7eac04a76b5a246 P cfa92dcbfd8f41798ec6d6fce44e6c2e: Bootstrap starting.
> I0930 09:20:45.025218 17173 tablet_bootstrap.cc:542] T fd6c30c11e8f488aa7eac04a76b5a246 P cfa92dcbfd8f41798ec6d6fce44e6c2e: Time spent opening tablet: real 0.000s	user 0.001s	sys 0.000s
> I0930 09:20:45.025753 17173 tablet_bootstrap.cc:598] T fd6c30c11e8f488aa7eac04a76b5a246 P cfa92dcbfd8f41798ec6d6fce44e6c2e: Will attempt to recover log segment /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246/wal-000000001 to /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001
> I0930 09:20:45.026008 17173 tablet_bootstrap.cc:606] T fd6c30c11e8f488aa7eac04a76b5a246 P cfa92dcbfd8f41798ec6d6fce44e6c2e: Moving log directory /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246 to recovery directory /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery in preparation for log replay
> W0930 09:20:45.026758 17173 log_util.cc:475] Log segment file /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001 has 12 initial NULL bytes instead of magic and header length: \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 and will be treated as a blank segment.
> I0930 09:20:45.027081 17173 log_util.cc:312] Log segment /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001 has no footer. This segment was likely being written when the server previously shut down.
> I0930 09:20:45.027308 17173 log_reader.cc:151] Log segment /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001 was likely left in-progress after a previous crash. Will try to rebuild footer by scanning data.
> I0930 09:20:45.027604 17173 log_util.cc:575] Scanning /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001 for valid entry headers following offset 12...
> I0930 09:20:45.842016 17173 log_util.cc:612] Found no log entry headers
> I0930 09:20:45.873320 17173 log_util.cc:215] Ignoring log segment corruption in /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001 because there are no log entries following the corrupted one. The server probably crashed in the middle of writing an entry to the write-ahead log or downloaded an active log via tablet copy. Error detail: Corruption: CRC mismatch in log entry header: Log file corruption detected. Failed trying to read batch #0 at offset 0 for log segment /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001: Prior entries:
> I0930 09:20:45.873550 17173 log_util.cc:365] Successfully rebuilt footer for segment: /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001 (valid entries through byte offset 0)
> F0930 09:20:45.873761 17173 log_util.h:208] Check failed: header_.IsInitialized() 
> *** Check failure stack trace: ***
>     @     0x7fa678822e0d  google::LogMessage::SendToLog() at ??:0
>     @     0x7fa6788237e9  google::LogMessage::Flush() at ??:0
>     @     0x7fa67882722b  google::LogMessageFatal::~LogMessageFatal() at ??:0
>     @     0x7fa67f074331  kudu::log::ReadableLogSegment::header() at ??:0
>     @     0x7fa67d623e37  kudu::log::LogReader::Init() at ??:0
>     @     0x7fa67d624488  kudu::log::LogReader::OpenFromRecoveryDir() at ??:0
>     @     0x7fa67dc38dcc  kudu::tablet::TabletBootstrap::OpenLogReaderInRecoveryDir() at ??:0
>     @     0x7fa67dc371fa  kudu::tablet::TabletBootstrap::Bootstrap() at ??:0
>     @     0x7fa67dc36c6d  kudu::tablet::BootstrapTablet() at ??:0
>     @     0x7fa67f0a1d98  kudu::tserver::TSTabletManager::OpenTablet() at ??:0
>     @     0x7fa67f0abeb8  boost::_mfi::mf2<>::operator()() at ??:0
>     @     0x7fa67f0abe04  boost::_bi::list3<>::operator()<>() at ??:0
>     @     0x7fa67f0abd74  boost::_bi::bind_t<>::operator()() at ??:0
>     @     0x7fa67f0abb12  boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
>     @     0x7fa67ca1f432  boost::function0<>::operator()() at ??:0
>     @     0x7fa6790744fe  kudu::FunctionRunnable::Run() at ??:0
>     @     0x7fa6790732de  kudu::ThreadPool::DispatchThread() at ??:0
>     @     0x7fa679075b9f  boost::_mfi::mf1<>::operator()() at ??:0
>     @     0x7fa679075afc  boost::_bi::list2<>::operator()<>() at ??:0
>     @     0x7fa679075a74  boost::_bi::bind_t<>::operator()() at ??:0
>     @     0x7fa679075872  boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
>     @     0x7fa67ca1f432  boost::function0<>::operator()() at ??:0
>     @     0x7fa67906cc55  kudu::Thread::SuperviseThread() at ??:0
>     @           0x42506c  __tsan_thread_start_func at ??:0
>     @       0x34186079d1  (unknown) at ??:0
>     @       0x34182e88fd  (unknown) at ??:0
>     @              (nil)  (unknown)
> {code}
> This is my first time spelunking through this code, but it seems we have some broken invariants in this case.  We have a {{ReadableLogSegment}} with an uninitialized header, but with its {{is_initialized_}} flag set to true.  The following comment from log_util.cc:388 leads me to believe this is not expected:
> {code}
>     // If a log file has been pre-allocated but not initialized, then
>     // 'header_size' will be 0 even the file size is > 0; in this
>     // case, 'is_initialized_' remains set to false and return
>     // Status::OK() early. LogReader ignores segments where
>     // IsInitialized() returns false.
> {code}
> You can see from the warning log lines above that we are indeed in the case where the magic bytes are all 0.  By reading through the code it's clear to me _how_ this happened ({{is_initalized_}} gets set at log_util.cc:321), but I'm not quite familiar with the code enough to figure out how to fix it, or what the relevant tests should be.  The stack at the point where this invariant is broken is (growing down):
> {code}
> ReadableLogSegment::Init() // No arg version, eventually sets is_initialized_=true since none of the initialization steps fail
> ReadableLogSegment::ReadHeader // Returns Status::OK back to Init, because header_size==0
> ReadableLogSegment::ReadHeaderMagicAndHeaderLength // Returns Status::OK
> ReadableLogSegment::ParseHeaderMagicAndHeaderLength // Returns Status::OK, because magic bytes are all 0, sets parsed_len=0 (parsed_len is the same as header_size in the ReadHeader frame)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)