You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Adar Dembo (JIRA)" <ji...@apache.org> on 2016/12/06 23:50:58 UTC
[jira] [Resolved] (KUDU-1791) read-only log block manager should
not truncate metadata files
[ https://issues.apache.org/jira/browse/KUDU-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adar Dembo resolved KUDU-1791.
------------------------------
Resolution: Fixed
Assignee: Adar Dembo
Fix Version/s: 1.2.0
Fixed in commit 2453a67310f62c01216e5a0ed08f192a08adc005.
> read-only log block manager should not truncate metadata files
> --------------------------------------------------------------
>
> Key: KUDU-1791
> URL: https://issues.apache.org/jira/browse/KUDU-1791
> Project: Kudu
> Issue Type: Bug
> Components: fs
> Affects Versions: 1.2.0
> Reporter: Adar Dembo
> Assignee: Adar Dembo
> Fix For: 1.2.0
>
>
> This appears to happen extremely rarely (i.e. not even on the flaky test dashboard); I'm noting it here in case it shows up again.
> The error:
> {noformat}
> F1206 15:43:33.546993 21974 open-readonly-fs-itest.cc:121] Check failed: _s.ok() Bad status: Corruption: Could not read records from container /tmp/run_tha_testB5l6uo/test-tmp/open-readonly-fs-itest.OpenReadonlyFsITest.TestWriteAndVerify.1481038978495057-21754/minicluster-data/ts-0/data/6a60c05828f24e168f34f3c2e8b664a8: Data length checksum does not match: Incorrect checksum in file /tmp/run_tha_testB5l6uo/test-tmp/open-readonly-fs-itest.OpenReadonlyFsITest.TestWriteAndVerify.1481038978495057-21754/minicluster-data/ts-0/data/6a60c05828f24e168f34f3c2e8b664a8.metadata at offset 4085: Checksum does not match. Expected: 0. Actual: 1214729159
> *** Check failure stack trace: ***
> @ 0x7eff9150b21d google::LogMessage::Fail() at ??:0
> @ 0x7eff9150d28c google::LogMessage::SendToLog() at ??:0
> @ 0x7eff9150ad79 google::LogMessage::Flush() at ??:0
> @ 0x7eff9150dc1f google::LogMessageFatal::~LogMessageFatal() at ??:0
> @ 0x40530f _ZZN4kudu5itest43OpenReadonlyFsITest_TestWriteAndVerify_Test8TestBodyEvENKUlvE_clEv at /home/jenkins-slave/workspace/kudu-0/thirdparty/installed/uninstrumented/include/glog/logging.h:697
> @ 0x7eff912aca40 (unknown) at ??:0
> @ 0x7eff8c8a3184 start_thread at ??:0
> @ 0x7eff90d1a37d clone at ??:0
> @ (nil) (unknown)
> {noformat}
> In this test, a client workload is performed concurrently with a looping thread that opens a read-only FsManager. Opening the FsManager forces the log block manager to reload all of the on-disk metadata every time; this test approximates the (real) use case of a read-only CLI filesystem tool running concurrently with a live Kudu server.
> The error itself shows the thread attempting to validate the length of a particular metadata record in a container. The validation does an 8 byte read, 4 bytes of which are the record length and 4 bytes of which are the length's checksum. The validation fails because the second 4 bytes are 0 while the length's actual checksum was non-zero.
> I scanned the reading/writing code in pb_util.cc but I can't see any obvious places where we're misusing the filesystem in such a way that we'd expect to see intermediate 0s in this field. For example, we always issue a single write() syscall to write a record to disk, including its length, checksum, body, and body checksum.
> I took another look at the test log and I think I've found the smoking gun:
> {noformat}
> W1206 15:43:32.967667 24555 log_block_manager.cc:502] Log block manager: Found partial trailing metadata record in container /tmp/run_tha_testB5l6uo/test-tmp/open-readonly-fs-itest.OpenReadonlyFsITest.TestWriteAndVerify.1481038978495057-21754/minicluster-data/ts-0/data/6a60c05828f24e168f34f3c2e8b664a8: Truncating metadata file to last valid offset: 4081
> {noformat}
> This shows a log block manager that, during startup, found a metadata file with a partial record and decided to truncate it. The problem: this must be the read-only FsManager thread because it's the only entity starting up over and over. Indeed, there's no read-only protection for this case, and there should be.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
Re: [jira] [Resolved] (KUDU-1791) read-only log block manager should not truncate metadata files
Posted by Dinesh Bhat <di...@cloudera.com>.
Thanks for triaging/fixing this Adar in such a short time.
> On Dec 7, 2016, at 5:20 AM, Adar Dembo (JIRA) <ji...@apache.org> wrote:
>
>
> [ https://issues.apache.org/jira/browse/KUDU-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Adar Dembo resolved KUDU-1791.
> ------------------------------
> Resolution: Fixed
> Assignee: Adar Dembo
> Fix Version/s: 1.2.0
>
> Fixed in commit 2453a67310f62c01216e5a0ed08f192a08adc005.
>
>> read-only log block manager should not truncate metadata files
>> --------------------------------------------------------------
>>
>> Key: KUDU-1791
>> URL: https://issues.apache.org/jira/browse/KUDU-1791
>> Project: Kudu
>> Issue Type: Bug
>> Components: fs
>> Affects Versions: 1.2.0
>> Reporter: Adar Dembo
>> Assignee: Adar Dembo
>> Fix For: 1.2.0
>>
>>
>> This appears to happen extremely rarely (i.e. not even on the flaky test dashboard); I'm noting it here in case it shows up again.
>> The error:
>> {noformat}
>> F1206 15:43:33.546993 21974 open-readonly-fs-itest.cc:121] Check failed: _s.ok() Bad status: Corruption: Could not read records from container /tmp/run_tha_testB5l6uo/test-tmp/open-readonly-fs-itest.OpenReadonlyFsITest.TestWriteAndVerify.1481038978495057-21754/minicluster-data/ts-0/data/6a60c05828f24e168f34f3c2e8b664a8: Data length checksum does not match: Incorrect checksum in file /tmp/run_tha_testB5l6uo/test-tmp/open-readonly-fs-itest.OpenReadonlyFsITest.TestWriteAndVerify.1481038978495057-21754/minicluster-data/ts-0/data/6a60c05828f24e168f34f3c2e8b664a8.metadata at offset 4085: Checksum does not match. Expected: 0. Actual: 1214729159
>> *** Check failure stack trace: ***
>> @ 0x7eff9150b21d google::LogMessage::Fail() at ??:0
>> @ 0x7eff9150d28c google::LogMessage::SendToLog() at ??:0
>> @ 0x7eff9150ad79 google::LogMessage::Flush() at ??:0
>> @ 0x7eff9150dc1f google::LogMessageFatal::~LogMessageFatal() at ??:0
>> @ 0x40530f _ZZN4kudu5itest43OpenReadonlyFsITest_TestWriteAndVerify_Test8TestBodyEvENKUlvE_clEv at /home/jenkins-slave/workspace/kudu-0/thirdparty/installed/uninstrumented/include/glog/logging.h:697
>> @ 0x7eff912aca40 (unknown) at ??:0
>> @ 0x7eff8c8a3184 start_thread at ??:0
>> @ 0x7eff90d1a37d clone at ??:0
>> @ (nil) (unknown)
>> {noformat}
>> In this test, a client workload is performed concurrently with a looping thread that opens a read-only FsManager. Opening the FsManager forces the log block manager to reload all of the on-disk metadata every time; this test approximates the (real) use case of a read-only CLI filesystem tool running concurrently with a live Kudu server.
>> The error itself shows the thread attempting to validate the length of a particular metadata record in a container. The validation does an 8 byte read, 4 bytes of which are the record length and 4 bytes of which are the length's checksum. The validation fails because the second 4 bytes are 0 while the length's actual checksum was non-zero.
>> I scanned the reading/writing code in pb_util.cc but I can't see any obvious places where we're misusing the filesystem in such a way that we'd expect to see intermediate 0s in this field. For example, we always issue a single write() syscall to write a record to disk, including its length, checksum, body, and body checksum.
>> I took another look at the test log and I think I've found the smoking gun:
>> {noformat}
>> W1206 15:43:32.967667 24555 log_block_manager.cc:502] Log block manager: Found partial trailing metadata record in container /tmp/run_tha_testB5l6uo/test-tmp/open-readonly-fs-itest.OpenReadonlyFsITest.TestWriteAndVerify.1481038978495057-21754/minicluster-data/ts-0/data/6a60c05828f24e168f34f3c2e8b664a8: Truncating metadata file to last valid offset: 4081
>> {noformat}
>> This shows a log block manager that, during startup, found a metadata file with a partial record and decided to truncate it. The problem: this must be the read-only FsManager thread because it's the only entity starting up over and over. Indeed, there's no read-only protection for this case, and there should be.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)