You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Lukasz Osipiuk (JIRA)" <ji...@apache.org> on 2010/03/18 16:54:27 UTC

[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-------------------------------------

    Attachment: node2-zookeeper.log.gz
                node1-zookeeper.log.gz
                zoo.cfg

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-zookeeper.log.gz, node2-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Ɓukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.