You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Alexey Serbin (Jira)" <ji...@apache.org> on 2023/03/13 06:55:00 UTC
[jira] [Updated] (KUDU-3458) Continue loading other tablets even if metadata for some tablets failed to load

     [ https://issues.apache.org/jira/browse/KUDU-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexey Serbin updated KUDU-3458:
--------------------------------
    Description: 
kudu-tserver stops tablet bootstrapping if a single tablet's metadata failed to load (the kudu-tserver process exits on such an event, but with caveat of KUDU-3419).

This current behavior requires manual intervention.  In most cases, the reason behind the failure to load tablet metadata is corrupted metadata file.  The suspect behind such a corruption is a power failure, kernel panic, etc. where opened file isn't synced.

In case of a cluster with many tablet servers, where RF=3, if majority of tablet replicas is present, such a situation with corrupted file could be addressed automatically if the tablet server would continue bootstrapping of other tablet replicas and eventually registered with Kudu masters.  The system catalog would detect that the tablet is under-replicated because one replica isn't running, and would re-replicate it elsewhere, sending DELETE_TABLET for the tablet replica that has the corrupted metadata file.  That'd be similar to what happens if a consensus metadata for a tablet replica were corrupted.

It's necessary to update the code in {{TSTabletManager}} and allow {{TSTabletManager::Init()}} to complete successfully in such case, marking corresponding tablet replicas as failed to load (similar to what's done in case of replica's consensus metadata).

  was:
kudu-tserver stops tablet bootstrapping if a single tablet's metadata failed to load (the kudu-tserver process exits on such an event, but with caveat of KUDU-3419).

This current behavior requires manual intervention.  In most cases, the reason behind the failure to load tablet metadata is corrupted metadata file.  The suspect behind such a corruption is a power failure, kernel panic, etc. where opened file isn't synced (?).

In case of a cluster with many tablet servers, where RF=3, if majority of tablet replicas is present, such a situation with corrupted file could be addressed automatically if the tablet server would continue bootstrapping of other tablet replicas and eventually registered with Kudu masters.  The system catalog would detect that the tablet is under-replicated because one replica isn't running, and would re-replicate it elsewhere, sending DELETE_TABLET for the tablet replica that has the corrupted metadata file.  That'd be similar to what happens if a consensus metadata for a tablet replica were corrupted.

It's necessary to update the code in {{TSTabletManager}} and allow {{TSTabletManager::Init()}} to complete successfully in such case, marking corresponding tablet replicas as failed to load (similar to what's done in case of replica's consensus metadata).


> Continue loading other tablets even if metadata for some tablets failed to load
> -------------------------------------------------------------------------------
>
>                 Key: KUDU-3458
>                 URL: https://issues.apache.org/jira/browse/KUDU-3458
>             Project: Kudu
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Alexey Serbin
>            Priority: Major
>
> kudu-tserver stops tablet bootstrapping if a single tablet's metadata failed to load (the kudu-tserver process exits on such an event, but with caveat of KUDU-3419).
> This current behavior requires manual intervention.  In most cases, the reason behind the failure to load tablet metadata is corrupted metadata file.  The suspect behind such a corruption is a power failure, kernel panic, etc. where opened file isn't synced.
> In case of a cluster with many tablet servers, where RF=3, if majority of tablet replicas is present, such a situation with corrupted file could be addressed automatically if the tablet server would continue bootstrapping of other tablet replicas and eventually registered with Kudu masters.  The system catalog would detect that the tablet is under-replicated because one replica isn't running, and would re-replicate it elsewhere, sending DELETE_TABLET for the tablet replica that has the corrupted metadata file.  That'd be similar to what happens if a consensus metadata for a tablet replica were corrupted.
> It's necessary to update the code in {{TSTabletManager}} and allow {{TSTabletManager::Init()}} to complete successfully in such case, marking corresponding tablet replicas as failed to load (similar to what's done in case of replica's consensus metadata).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)