You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@kudu.apache.org by "zhangsong (JIRA)" <ji...@apache.org> on 2017/01/16 07:14:26 UTC

[jira] [Commented] (KUDU-1579) into "safe mode" when large number of node crash

    [ https://issues.apache.org/jira/browse/KUDU-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15823558#comment-15823558 ] 

zhangsong commented on KUDU-1579:
---------------------------------

After experiencing several node failure cases(using kudu-tserver revision b906affcdee3ec814c9e96d35fea715fdbb4c330-dirty), i found these two fact.
1 when multiple kudu-tserver nodes crash at same time(not exact at same time), (let say 5 kudu nodes), there willl be failed tablet , reasons of the failed tablets should be thoses described in issue kudu-1449. Also from kudu-master ui i can see a lot of addServer/removeServer task hang there and there is no sign that they will recover automatically.
2 when facing multiply nodes crash, stop kudu-master until whole cluster is stable(no more node crash), restart kudu-master . After recovered all crashed kudu-tserver node , no failed tablet found. 

So for my case, i seems kudu-master should freeze for sometime when facing multiple node crashed at same time (eg.within some period of time) freeze here , means it stop servicing addServer/RemoveServer rpc . 
Just some thoughts today , may complete this later.

> into "safe mode"   when large number of node crash
> --------------------------------------------------
>
>                 Key: KUDU-1579
>                 URL: https://issues.apache.org/jira/browse/KUDU-1579
>             Project: Kudu
>          Issue Type: New Feature
>            Reporter: zhangsong
>
> Currently, replication will happen when met node crash .
> However when met large number of node crash , it will lead to replicate storm
> which will cause mess and data loss.
> replication should be prudent and the cluster should be into a "safe mode" in aboved node crash case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)