You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Lukasz Osipiuk (JIRA)" <ji...@apache.org> on 2010/03/18 16:54:27 UTC
[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken
snapshot?
[ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lukasz Osipiuk updated ZOOKEEPER-713:
-------------------------------------
Attachment: node2-zookeeper.log.gz
node1-zookeeper.log.gz
zoo.cfg
> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
> Key: ZOOKEEPER-713
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
> Project: Zookeeper
> Issue Type: Bug
> Affects Versions: 3.2.2
> Environment: debian lenny; ia64; xen virtualization
> Reporter: Lukasz Osipiuk
> Attachments: node1-zookeeper.log.gz, node2-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot' message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place.
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Ćukasz Osipiuk
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.