You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@kudu.apache.org by "Adar Dembo (JIRA)" <ji...@apache.org> on 2016/12/06 23:50:58 UTC

[jira] [Resolved] (KUDU-1791) read-only log block manager should not truncate metadata files

     [ https://issues.apache.org/jira/browse/KUDU-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adar Dembo resolved KUDU-1791.
------------------------------
       Resolution: Fixed
         Assignee: Adar Dembo
    Fix Version/s: 1.2.0

Fixed in commit 2453a67310f62c01216e5a0ed08f192a08adc005.

> read-only log block manager should not truncate metadata files
> --------------------------------------------------------------
>
>                 Key: KUDU-1791
>                 URL: https://issues.apache.org/jira/browse/KUDU-1791
>             Project: Kudu
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 1.2.0
>            Reporter: Adar Dembo
>            Assignee: Adar Dembo
>             Fix For: 1.2.0
>
>
> This appears to happen extremely rarely (i.e. not even on the flaky test dashboard); I'm noting it here in case it shows up again.
> The error:
> {noformat}
> F1206 15:43:33.546993 21974 open-readonly-fs-itest.cc:121] Check failed: _s.ok() Bad status: Corruption: Could not read records from container /tmp/run_tha_testB5l6uo/test-tmp/open-readonly-fs-itest.OpenReadonlyFsITest.TestWriteAndVerify.1481038978495057-21754/minicluster-data/ts-0/data/6a60c05828f24e168f34f3c2e8b664a8: Data length checksum does not match: Incorrect checksum in file /tmp/run_tha_testB5l6uo/test-tmp/open-readonly-fs-itest.OpenReadonlyFsITest.TestWriteAndVerify.1481038978495057-21754/minicluster-data/ts-0/data/6a60c05828f24e168f34f3c2e8b664a8.metadata at offset 4085: Checksum does not match. Expected: 0. Actual: 1214729159
> *** Check failure stack trace: ***
>     @     0x7eff9150b21d  google::LogMessage::Fail() at ??:0
>     @     0x7eff9150d28c  google::LogMessage::SendToLog() at ??:0
>     @     0x7eff9150ad79  google::LogMessage::Flush() at ??:0
>     @     0x7eff9150dc1f  google::LogMessageFatal::~LogMessageFatal() at ??:0
>     @           0x40530f  _ZZN4kudu5itest43OpenReadonlyFsITest_TestWriteAndVerify_Test8TestBodyEvENKUlvE_clEv at /home/jenkins-slave/workspace/kudu-0/thirdparty/installed/uninstrumented/include/glog/logging.h:697
>     @     0x7eff912aca40  (unknown) at ??:0
>     @     0x7eff8c8a3184  start_thread at ??:0
>     @     0x7eff90d1a37d  clone at ??:0
>     @              (nil)  (unknown)
> {noformat}
> In this test, a client workload is performed concurrently with a looping thread that opens a read-only FsManager. Opening the FsManager forces the log block manager to reload all of the on-disk metadata every time; this test approximates the (real) use case of a read-only CLI filesystem tool running concurrently with a live Kudu server.
> The error itself shows the thread attempting to validate the length of a particular metadata record in a container. The validation does an 8 byte read, 4 bytes of which are the record length and 4 bytes of which are the length's checksum. The validation fails because the second 4 bytes are 0 while the length's actual checksum was non-zero.
> I scanned the reading/writing code in pb_util.cc but I can't see any obvious places where we're misusing the filesystem in such a way that we'd expect to see intermediate 0s in this field. For example, we always issue a single write() syscall to write a record to disk, including its length, checksum, body, and body checksum.
> I took another look at the test log and I think I've found the smoking gun: 
> {noformat}
> W1206 15:43:32.967667 24555 log_block_manager.cc:502] Log block manager: Found partial trailing metadata record in container /tmp/run_tha_testB5l6uo/test-tmp/open-readonly-fs-itest.OpenReadonlyFsITest.TestWriteAndVerify.1481038978495057-21754/minicluster-data/ts-0/data/6a60c05828f24e168f34f3c2e8b664a8: Truncating metadata file to last valid offset: 4081
> {noformat}
> This shows a log block manager that, during startup, found a metadata file with a partial record and decided to truncate it. The problem: this must be the read-only FsManager thread because it's the only entity starting up over and over. Indeed, there's no read-only protection for this case, and there should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [jira] [Resolved] (KUDU-1791) read-only log block manager should not truncate metadata files

Posted by Dinesh Bhat <di...@cloudera.com>.

Thanks for triaging/fixing this Adar in such a short time. 

> On Dec 7, 2016, at 5:20 AM, Adar Dembo (JIRA) <ji...@apache.org> wrote:
> 
> 
>     [ https://issues.apache.org/jira/browse/KUDU-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
> 
> Adar Dembo resolved KUDU-1791.
> ------------------------------
>       Resolution: Fixed
>         Assignee: Adar Dembo
>    Fix Version/s: 1.2.0
> 
> Fixed in commit 2453a67310f62c01216e5a0ed08f192a08adc005.
> 
>> read-only log block manager should not truncate metadata files
>> --------------------------------------------------------------
>> 
>>                Key: KUDU-1791
>>                URL: https://issues.apache.org/jira/browse/KUDU-1791
>>            Project: Kudu
>>         Issue Type: Bug
>>         Components: fs
>>   Affects Versions: 1.2.0
>>           Reporter: Adar Dembo
>>           Assignee: Adar Dembo
>>            Fix For: 1.2.0
>> 
>> 
>> This appears to happen extremely rarely (i.e. not even on the flaky test dashboard); I'm noting it here in case it shows up again.
>> The error:
>> {noformat}
>> F1206 15:43:33.546993 21974 open-readonly-fs-itest.cc:121] Check failed: _s.ok() Bad status: Corruption: Could not read records from container /tmp/run_tha_testB5l6uo/test-tmp/open-readonly-fs-itest.OpenReadonlyFsITest.TestWriteAndVerify.1481038978495057-21754/minicluster-data/ts-0/data/6a60c05828f24e168f34f3c2e8b664a8: Data length checksum does not match: Incorrect checksum in file /tmp/run_tha_testB5l6uo/test-tmp/open-readonly-fs-itest.OpenReadonlyFsITest.TestWriteAndVerify.1481038978495057-21754/minicluster-data/ts-0/data/6a60c05828f24e168f34f3c2e8b664a8.metadata at offset 4085: Checksum does not match. Expected: 0. Actual: 1214729159
>> *** Check failure stack trace: ***
>>    @     0x7eff9150b21d  google::LogMessage::Fail() at ??:0
>>    @     0x7eff9150d28c  google::LogMessage::SendToLog() at ??:0
>>    @     0x7eff9150ad79  google::LogMessage::Flush() at ??:0
>>    @     0x7eff9150dc1f  google::LogMessageFatal::~LogMessageFatal() at ??:0
>>    @           0x40530f  _ZZN4kudu5itest43OpenReadonlyFsITest_TestWriteAndVerify_Test8TestBodyEvENKUlvE_clEv at /home/jenkins-slave/workspace/kudu-0/thirdparty/installed/uninstrumented/include/glog/logging.h:697
>>    @     0x7eff912aca40  (unknown) at ??:0
>>    @     0x7eff8c8a3184  start_thread at ??:0
>>    @     0x7eff90d1a37d  clone at ??:0
>>    @              (nil)  (unknown)
>> {noformat}
>> In this test, a client workload is performed concurrently with a looping thread that opens a read-only FsManager. Opening the FsManager forces the log block manager to reload all of the on-disk metadata every time; this test approximates the (real) use case of a read-only CLI filesystem tool running concurrently with a live Kudu server.
>> The error itself shows the thread attempting to validate the length of a particular metadata record in a container. The validation does an 8 byte read, 4 bytes of which are the record length and 4 bytes of which are the length's checksum. The validation fails because the second 4 bytes are 0 while the length's actual checksum was non-zero.
>> I scanned the reading/writing code in pb_util.cc but I can't see any obvious places where we're misusing the filesystem in such a way that we'd expect to see intermediate 0s in this field. For example, we always issue a single write() syscall to write a record to disk, including its length, checksum, body, and body checksum.
>> I took another look at the test log and I think I've found the smoking gun: 
>> {noformat}
>> W1206 15:43:32.967667 24555 log_block_manager.cc:502] Log block manager: Found partial trailing metadata record in container /tmp/run_tha_testB5l6uo/test-tmp/open-readonly-fs-itest.OpenReadonlyFsITest.TestWriteAndVerify.1481038978495057-21754/minicluster-data/ts-0/data/6a60c05828f24e168f34f3c2e8b664a8: Truncating metadata file to last valid offset: 4081
>> {noformat}
>> This shows a log block manager that, during startup, found a metadata file with a partial record and decided to truncate it. The problem: this must be the read-only FsManager thread because it's the only entity starting up over and over. Indeed, there's no read-only protection for this case, and there should be.
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)