You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Will Berkeley (JIRA)" <ji...@apache.org> on 2018/06/14 15:19:00 UTC

[jira] [Commented] (KUDU-2476) Kudu restart creates many tombstone tablets

    [ https://issues.apache.org/jira/browse/KUDU-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16512603#comment-16512603 ] 

Will Berkeley commented on KUDU-2476:
-------------------------------------

[~farkastfbic]
 * is it ok that when the timesync changes all the nodes suddenly crashes?

Yes that's kind of expected. Kudu requires synchronized clocks to run. It has some ability to ride over temporary periods of desynchronization, but it will crash if the clock error gets too high. You may have hit KUDU-2209, which is fixed in 1.5.1. Seeing the FATAL error would help.
 * is it ok that after Kudu service restart the Kudu tablets goes crazy and starts to send accross the network lot of data, "syncing" up, which takes ~5-10minutes (800tablets/kudu tablet server)

It's not OK but lots of re-replication at startup is expected in many cases. When tablets start at different times and at different rates and the servers take a while to finish the startup process it can trigger a lot of re-replication. Improvements in 1.7, namely KUDU-1097 AKA 3-4-3 replication, make this situation a lot better. Tombstones are a necessary side-effect of re-replication. They consume almost no resources and they are necessary for correctness in certain corner cases (so don't try to delete them).

Also it's better to use the user mailing list or the Slack to ask questions like this and reserve JIRA for filing bugs or feature requests. I'm going to close this JIRA but please feel free to follow up in one of those forums.

> Kudu restart creates many tombstone tablets
> -------------------------------------------
>
>                 Key: KUDU-2476
>                 URL: https://issues.apache.org/jira/browse/KUDU-2476
>             Project: Kudu
>          Issue Type: Bug
>    Affects Versions: 1.5.0
>            Reporter: Tomas Farkas
>            Priority: Major
>
> After changing chronyc conf and restarting on all nodes the chronyd daemon, all the Kudu servers exited unexpectedly. Therefore I did restart all the Kudu nodes (tablet servers and masters) and when they came up, many tablets were in initialized state and many tablets ended in a tombstone state.
> Flags
> Live Tablets
> Summary
> Status	Count	Percentage
> BOOTSTRAPPING	4	0.50
> INITIALIZED	117	14.68
> RUNNING	676	84.82
> Total	797
> The tables consistency seems ok after the restart, but I have two questions:
> - is it ok that when the timesync changes all the nodes suddenly crashes?
> - is it ok that after Kudu service restart the Kudu tablets goes crazy and starts to send accross the network lot of data, "syncing" up, which takes ~5-10minutes (800tablets/kudu tablet server)
> Shouldnt be the behaviour that the Kudu tablet server waits a little bit during the restart and then starts to replicate the data?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)