You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Abhishek (Jira)" <ji...@apache.org> on 2021/05/25 15:37:00 UTC

[jira] [Commented] (KUDU-1959) Hard to tell when a cluster is done starting up

    [ https://issues.apache.org/jira/browse/KUDU-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351135#comment-17351135 ] 

Abhishek commented on KUDU-1959:
--------------------------------

As a first step towards this issue, we could have a tablet server startup page which shows the progress of the starting up. Guess we could break down the tablet server startup into a few phases (something like initializing, reading the metadata directory, reading the data directory, bootstrapping, connecting to the masters). The usual major time consuming phases are reading the log block containers (reading the data directory) and bootstrapping the tablets. 

For these two phases we can include the total LBM containers/tablets present and the ones which are processed until that time to keep a track of the progress of the startup. 
Now for the question, how do we get total LBM containers - since we do not have any metric for that yet (Even if we had this would have been reset after the restart of the server), we could just get the number of data files in the presented data directories.
The total tablets present is obtainable after scanning the metadata directory.

In the current state we start the tablet server WebUI while the tablets are in bootstrapping phase. We could startup the WebUI before this phase but just start the Tablet server startup progress page and load the other pages once we get to the bootstrapping phase.

> Hard to tell when a cluster is done starting up
> -----------------------------------------------
>
>                 Key: KUDU-1959
>                 URL: https://issues.apache.org/jira/browse/KUDU-1959
>             Project: Kudu
>          Issue Type: Improvement
>          Components: ops-tooling
>            Reporter: Jean-Daniel Cryans
>            Assignee: Abhishek
>            Priority: Major
>              Labels: roadmap-candidate, usability
>
> Restarting a cluster that has a good amount of data, it's hard to tell when it's "done". Right now the things I do:
>  - Run ksck, wait until most tablets are not in "unavailable" or "boostrapping" state.
>  - Watch the metrics and see when the data under management is close to where it was before restarting (it grows as tablets are getting bootstrapped).
>  - Look at the tablet server web UIs for tablets, compare how many are done bootstrapping VS in the process of VS not started.
> Ideas on how to improve this:
>  - In the master's web UI for tablet servers, show how many tablets are running VS not running (I wouldn't add anything about tombstoned tablets)
>  - Add metrics for tablets in different states.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)