You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Adar Dembo (JIRA)" <ji...@apache.org> on 2016/03/07 23:47:40 UTC
[jira] [Resolved] (KUDU-1362) Ensure master behaves correctly after
a sys_catalog write failure
[ https://issues.apache.org/jira/browse/KUDU-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adar Dembo resolved KUDU-1362.
------------------------------
Resolution: Duplicate
Whoops, Alex already had an issue tracking this.
> Ensure master behaves correctly after a sys_catalog write failure
> -----------------------------------------------------------------
>
> Key: KUDU-1362
> URL: https://issues.apache.org/jira/browse/KUDU-1362
> Project: Kudu
> Issue Type: Bug
> Components: master
> Affects Versions: 0.7.0
> Reporter: Adar Dembo
> Assignee: Adar Dembo
> Priority: Critical
>
> For multi-master usage to truly be safe, we must ensure that a failure to write to the system catalog table is handled correctly. When there's only one master this can only happen in the event of a disk failure or equivalent, but with multiple masters, failures can happen all the time (i.e. failed replicas, network partitions, etc.)
> So far I've only found one case where this is truly broken, in catalog_manager.cc:L2444:
> {noformat}
> 2433 void CatalogManager::DeleteTabletsAndSendRequests(const scoped_refptr<TableInfo>& table) {
> 2434 vector<scoped_refptr<TabletInfo> > tablets;
> 2435 table->GetAllTablets(&tablets);
> 2436
> 2437 string deletion_msg = "Table deleted at " + LocalTimeAsString();
> 2438
> 2439 for (const scoped_refptr<TabletInfo>& tablet : tablets) {
> 2440 DeleteTabletReplicas(tablet.get(), deletion_msg);
> 2441
> 2442 TabletMetadataLock tablet_lock(tablet.get(), TabletMetadataLock::WRITE);
> 2443 tablet_lock.mutable_data()->set_state(SysTabletsEntryPB::DELETED, deletion_msg);
> >2444 CHECK_OK(sys_catalog_->UpdateTablets({ tablet.get() }));
> 2445 tablet_lock.Commit();
> 2446 }
> 2447 }
> {noformat}
> In this case we should batch up all of the tablet deletions into one UpdateTablets() call, and pass the status up to the DeleteTable caller too.
> Part of the work here is an integration test that provides good coverage for the various failure paths.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)