You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@kudu.apache.org by "Mike Percy (JIRA)" <ji...@apache.org> on 2019/04/23 17:10:00 UTC

[jira] [Created] (KUDU-2792) Automatically retry failed bootstrap on tablets that failed to start due to disk space

Mike Percy created KUDU-2792:
--------------------------------

             Summary: Automatically retry failed bootstrap on tablets that failed to start due to disk space
                 Key: KUDU-2792
                 URL: https://issues.apache.org/jira/browse/KUDU-2792
             Project: Kudu
          Issue Type: Task
          Components: tserver
    Affects Versions: 1.8.0
            Reporter: Mike Percy


If a tablet replica fails to bootstrap due to insufficient disk space to replay the WAL, it will remain in a state that looks like this in ksck, even if the user frees up disk space:

 
{code:java}
5edf82f0516b4897b3a7991a7e67d71c (host1.example.com:7050): not running [LEADER]
 State: FAILED
 Data state: TABLET_DATA_READY
 Last status: IO error: Failed log replay. Reason: Failed to open new log: Insufficient disk space to allocate 8388608 bytes under path /data/1/kudu/tablet/wal/wals/5807c5100e0d4522a66e32efbb29d57e/.kudutmp.newsegmentzGFKEg (7939936256 bytes available vs 19993874923 bytes reserved) (error 28)
{code}
Today, this requires a tablet server restart to recover from.

It should be possible for a tablet server (i.e. the TsTabletManager) to detect that the failure was temporary, not permanent, and retry the failed bootstrap later on when additional disk space has been freed. From a programming perspective, that may require dealing with some object lifecycle issues (i.e. not reusing the Tablet object from the failed bootstrap).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)