You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@kudu.apache.org by "Adar Dembo (JIRA)" <ji...@apache.org> on 2016/03/07 23:47:40 UTC

[jira] [Resolved] (KUDU-1362) Ensure master behaves correctly after a sys_catalog write failure

     [ https://issues.apache.org/jira/browse/KUDU-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adar Dembo resolved KUDU-1362.
------------------------------
    Resolution: Duplicate

Whoops, Alex already had an issue tracking this.

> Ensure master behaves correctly after a sys_catalog write failure
> -----------------------------------------------------------------
>
>                 Key: KUDU-1362
>                 URL: https://issues.apache.org/jira/browse/KUDU-1362
>             Project: Kudu
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.7.0
>            Reporter: Adar Dembo
>            Assignee: Adar Dembo
>            Priority: Critical
>
> For multi-master usage to truly be safe, we must ensure that a failure to write to the system catalog table is handled correctly. When there's only one master this can only happen in the event of a disk failure or equivalent, but with multiple masters, failures can happen all the time (i.e. failed replicas, network partitions, etc.)
> So far I've only found one case where this is truly broken, in catalog_manager.cc:L2444:
> {noformat}
>    2433 void CatalogManager::DeleteTabletsAndSendRequests(const scoped_refptr<TableInfo>& table) {
>    2434   vector<scoped_refptr<TabletInfo> > tablets;
>    2435   table->GetAllTablets(&tablets);
>    2436 
>    2437   string deletion_msg = "Table deleted at " + LocalTimeAsString();
>    2438 
>    2439   for (const scoped_refptr<TabletInfo>& tablet : tablets) {
>    2440     DeleteTabletReplicas(tablet.get(), deletion_msg);
>    2441 
>    2442     TabletMetadataLock tablet_lock(tablet.get(), TabletMetadataLock::WRITE);
>    2443     tablet_lock.mutable_data()->set_state(SysTabletsEntryPB::DELETED, deletion_msg);
>   >2444     CHECK_OK(sys_catalog_->UpdateTablets({ tablet.get() }));
>    2445     tablet_lock.Commit();
>    2446   }
>    2447 }
> {noformat}
> In this case we should batch up all of the tablet deletions into one UpdateTablets() call, and pass the status up to the DeleteTable caller too.
> Part of the work here is an integration test that provides good coverage for the various failure paths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)