You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Lukasz Osipiuk (JIRA)" <ji...@apache.org> on 2010/03/18 16:54:27 UTC

[jira] Created: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

zookeeper fails to start - broken snapshot?
-------------------------------------------

                 Key: ZOOKEEPER-713
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
             Project: Zookeeper
          Issue Type: Bug
    Affects Versions: 3.2.2
         Environment: debian lenny; ia64; xen virtualization
            Reporter: Lukasz Osipiuk
         Attachments: node1-zookeeper.log.gz, node2-zookeeper.log.gz, zoo.cfg

Hi guys,

The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.

Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.

Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.

We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.

I am attaching contents of version-2 directory from all nodes and server logs.
Source problem occurred some time before 15. First cluster restart happened at 15:03.
At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.

I am also attaching zoo.cfg. Maybe something is wrong at this place. 
As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.

Best regards, Łukasz Osipiuk



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847407#action_12847407 ] 

Patrick Hunt commented on ZOOKEEPER-713:
----------------------------------------

Given that you are addressing some of the other issues we identified 15 should be enough.  5 might even be enough once things are running more smoothly, you can get a better sense using the 'stat' command as I mentioned elsewhere.


> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, node3-version-2.tgz-ab, node3-version-2.tgz-ac, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk
> PS. due to attachment size limit I used split. to untar use 
> cat nodeX-version-2.tgz-* |tar -xz

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Lukasz Osipiuk (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-------------------------------------

    Attachment: node3-version-2.tgz-aa

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, node3-version-2.tgz-ab, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Lukasz Osipiuk (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-------------------------------------

    Attachment: node2-version-2.tgz-aa

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-zookeeper.log.gz, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846976#action_12846976 ] 

Benjamin Reed commented on ZOOKEEPER-713:
-----------------------------------------

hey, just for a sanity check: you aren't running out of disk space are you?

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, node3-version-2.tgz-ab, node3-version-2.tgz-ac, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk
> PS. due to attachment size limit I used split. to untar use 
> cat nodeX-version-2.tgz-* |tar -xz

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Lukasz Osipiuk (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-------------------------------------

    Attachment: node2-zookeeper.log.gz
                node1-zookeeper.log.gz
                zoo.cfg

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-zookeeper.log.gz, node2-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847156#action_12847156 ] 

Patrick Hunt commented on ZOOKEEPER-713:
----------------------------------------

> I guess JVM was swapping (as it was running out of memory) caused delays with file transfer. What do you think? Is it possible?

We did see cases in your logs where the init seemed to take very short amount of time, and some cases where it took a very long time. Could be this, could be virtualization, intermittent network problems? ... any sound possible.

> One more question - Is it normal that zookeeper consumed 0.5G of memory handling such small snapshot?

If you look at that troubleshooting page you'll see a link to the latency overview http://bit.ly/4ekN8G Notice this used a heap size of 512m with 1.6.0_05 jvm. I was creating a large number of znodes and watches and it worked fine - with 20 client case I'm creating 200k znodes (20mb) and 1 million watches.

I don't know what might be taking the memory in your case (not knowing the use cases, znode count, avg size, etc...) but you can get more insight using something like visualvm or jconsole. Try attaching to the running VM and take a look.

> I will consider building newer java deb package myself but I would rather treat is a last resort. Do you really think newer java version could help?

I wouldn't try that as a first option. It's something we noticed and wanted to mention in case it was easy to try.

Do take a look at the troubleshooting page - adding some monitoring of the jvm might help to provide insight. Also monitoring of the parameters that ZooKeeper itself makes avail through the command port and JMX. Also some of the other issues there have been faced by more than one user and might be helpful.



> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, node3-version-2.tgz-ab, node3-version-2.tgz-ac, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk
> PS. due to attachment size limit I used split. to untar use 
> cat nodeX-version-2.tgz-* |tar -xz

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Lukasz Osipiuk (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-------------------------------------

    Attachment: node1-version-2.tgz-ab

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-zookeeper.log.gz, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Lukasz Osipiuk (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-------------------------------------

    Attachment: node1-version-2.tgz-aa

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-zookeeper.log.gz, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Lukasz Osipiuk (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846990#action_12846990 ] 

Lukasz Osipiuk commented on ZOOKEEPER-713:
------------------------------------------

Nope - we have plenty of free space

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, node3-version-2.tgz-ab, node3-version-2.tgz-ac, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk
> PS. due to attachment size limit I used split. to untar use 
> cat nodeX-version-2.tgz-* |tar -xz

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Lukasz Osipiuk (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-------------------------------------

    Attachment: node3-version-2.tgz-ab

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, node3-version-2.tgz-ab, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Reed resolved ZOOKEEPER-713.
-------------------------------------

    Resolution: Invalid

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, node3-version-2.tgz-ab, node3-version-2.tgz-ac, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk
> PS. due to attachment size limit I used split. to untar use 
> cat nodeX-version-2.tgz-* |tar -xz

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Lukasz Osipiuk (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847273#action_12847273 ] 

Lukasz Osipiuk commented on ZOOKEEPER-713:
------------------------------------------

We had timeout of 5 secs when this logs were written. I already increased it to 15 secs. Will if that is enough.

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, node3-version-2.tgz-ab, node3-version-2.tgz-ac, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk
> PS. due to attachment size limit I used split. to untar use 
> cat nodeX-version-2.tgz-* |tar -xz

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847013#action_12847013 ] 

Benjamin Reed commented on ZOOKEEPER-713:
-----------------------------------------

lukasz thank you for reporting this problem and providing the details. we analyzed your logs and found the following problems:

1) you are getting an OutOfMemoryError during the snapshot. that is why you are getting the invalid snapshot file. it turns out that the invalid file isn't really a problem since we can just use an older snapshot to recover from, but this may indicate that you are running very close to the limit and spending a lot of time in the GC. (this may aggravate the next issue.)

2) the initLimit and ticktime in zoo.cfg may too low. how much data is stored in zookeeper? look at the snapshot filesize, you need to be able to transmit the snapshot within the initLimit*ticktime. as an experiment try scping the snapshots between the different servers and see how long it takes. ticktime should be increased on wan connections. you might try doubling the ticktime and initLimit. (if you really are overloading the GC it is going to be slowing things down.)

3) we also noticed that it is taking a long time to read and write snapshots (~17 and ~40 seconds in some cases). do you have other things contending with the disk? this is going to affect how long it takes for the leader to respond to a client, and thus the initLimit.

4) we noticed that the jvm version you are using is pretty old. you may try upgrading to the latest version, especially since you are using 64-bit linux.

5) finally, it appears you are running under xen. some people have had issues running in virtualized environments. the main problems being memory and disk contention. if java ever starts swapping, the GC can hang the process while pages go on an off the disk for reference checking. disk contention will also slow down recovery and logging to disk during write operations.

we are opening a jira, ZOOKEEPER-714, related to 1) since we really should exit on OutOfMemoryErrors. in your specific case we recovered, but in general it doesn't seem wise to continue execution once you have gotten an OutOfMemoryError.

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, node3-version-2.tgz-ab, node3-version-2.tgz-ac, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk
> PS. due to attachment size limit I used split. to untar use 
> cat nodeX-version-2.tgz-* |tar -xz

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Lukasz Osipiuk (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-------------------------------------

    Attachment:     (was: node2-version-2.tgz-ab)

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-zookeeper.log.gz, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847019#action_12847019 ] 

Patrick Hunt commented on ZOOKEEPER-713:
----------------------------------------

also see the troubleshooting page http://bit.ly/5WwS44 for details on issues ppl have faced in the past that might help you

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, node3-version-2.tgz-ab, node3-version-2.tgz-ac, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk
> PS. due to attachment size limit I used split. to untar use 
> cat nodeX-version-2.tgz-* |tar -xz

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847251#action_12847251 ] 

Patrick Hunt commented on ZOOKEEPER-713:
----------------------------------------

One other thing that I thought of. What are you using for client session timeout? You should probably go more on the 30sec timeout, rather than 5sec timeout given the timing that we're seeing on the server (such as 7 sec for session establishment). You can monitor the min/max/avg latency on your servers using the 'stat' command on the command port (4 letter word). Using this information you can get a good idea what would work (if latency exceeds the timeout then clients on that server would timeout) and tune appropriately.


> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, node3-version-2.tgz-ab, node3-version-2.tgz-ac, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk
> PS. due to attachment size limit I used split. to untar use 
> cat nodeX-version-2.tgz-* |tar -xz

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Lukasz Osipiuk (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-------------------------------------

    Attachment: node2-version-2.tgz-ac

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, node3-version-2.tgz-ab, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Lukasz Osipiuk (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-------------------------------------

    Attachment: node3-zookeeper.log.gz

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-zookeeper.log.gz, node2-zookeeper.log.gz, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Lukasz Osipiuk (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-------------------------------------

    Attachment: node2-version-2.tgz-aa

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-zookeeper.log.gz, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847014#action_12847014 ] 

Benjamin Reed commented on ZOOKEEPER-713:
-----------------------------------------

btw, for 1) you may want to increase the amount of heap space assigned to avoid the OutOfMemory exception. (keep in mind the issues in 5). even without xen you still need to make sure that the amount of memory allocated to the JVM is less than physical memory.)

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, node3-version-2.tgz-ab, node3-version-2.tgz-ac, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk
> PS. due to attachment size limit I used split. to untar use 
> cat nodeX-version-2.tgz-* |tar -xz

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Lukasz Osipiuk (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-------------------------------------

    Description: 
Hi guys,

The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.

Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.

Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.

We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.

I am attaching contents of version-2 directory from all nodes and server logs.
Source problem occurred some time before 15. First cluster restart happened at 15:03.
At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.

I am also attaching zoo.cfg. Maybe something is wrong at this place. 
As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.

Best regards, Łukasz Osipiuk

PS. due to attachment size limit I used split. to untar use 
cat nodeX-version-2.tgz-* |tar -xz


  was:
Hi guys,

The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.

Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.

Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.

We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.

I am attaching contents of version-2 directory from all nodes and server logs.
Source problem occurred some time before 15. First cluster restart happened at 15:03.
At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.

I am also attaching zoo.cfg. Maybe something is wrong at this place. 
As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.

Best regards, Łukasz Osipiuk




> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, node3-version-2.tgz-ab, node3-version-2.tgz-ac, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk
> PS. due to attachment size limit I used split. to untar use 
> cat nodeX-version-2.tgz-* |tar -xz

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Patrick Hunt closed ZOOKEEPER-713.
----------------------------------


> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, node3-version-2.tgz-ab, node3-version-2.tgz-ac, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk
> PS. due to attachment size limit I used split. to untar use 
> cat nodeX-version-2.tgz-* |tar -xz

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Lukasz Osipiuk (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847124#action_12847124 ] 

Lukasz Osipiuk commented on ZOOKEEPER-713:
------------------------------------------

Thanks for detailed answer.

{quote}
lukasz thank you for reporting this problem and providing the details. we analyzed your logs and found the following problems:

1) you are getting an OutOfMemoryError during the snapshot. that is why you are getting the invalid snapshot file. it turns out that the invalid file isn't really a problem since we can just use an older snapshot to recover from, but this may indicate that you are running very close to the limit and spending a lot of time in the GC. (this may aggravate the next issue.)
{quote}

It appears we did not adjust java heap size and we are running on 512M of memory. I will increase heap size tomorrow.

{quote}
2) the initLimit and ticktime in zoo.cfg may too low. how much data is stored in zookeeper? look at the snapshot filesize, you need to be able to transmit the snapshot within the initLimit*ticktime. as an experiment try scping the snapshots between the different servers and see how long it takes. ticktime should be increased on wan connections. you might try doubling the ticktime and initLimit. (if you really are overloading the GC it is going to be slowing things down.)
{quote}

Our snapshot is quite small. At time problem occured i had around 11M. I checked with scp and copying takes much less than a second.
I guess JVM was swapping (as it was running out of memory) caused delays with file transfer. What do you think? Is it possible?

One more question - Is it normal that zookeeper consumed 0.5G of memory handling such small snapshot?

{quote}
3) we also noticed that it is taking a long time to read and write snapshots (~17 and ~40 seconds in some cases). do you have other things contending with the disk? this is going to affect how long it takes for the leader to respond to a client, and thus the initLimit.
{quote}

Same as above - maybe memory limit was the problem. Anyway I will move zookeeper guests to xenhosts which does not have any disk intensive guests - to limit disc contention. Hope this helps. On xenhosts on which zookeeper is run some mysqls are run too. They are not very intensively used but it still good to fix it.

{quote}
4) we noticed that the jvm version you are using is pretty old. you may try upgrading to the latest version, especially since you are using 64-bit linux.
{quote}

Well - debian stable and "up to date" does not work well together. I will consider building newer java deb package myself but I would rather 
treat is a last resort. Do you really think newer java version could help?


> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, node3-version-2.tgz-ab, node3-version-2.tgz-ac, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk
> PS. due to attachment size limit I used split. to untar use 
> cat nodeX-version-2.tgz-* |tar -xz

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Lukasz Osipiuk (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-------------------------------------

    Attachment: node3-version-2.tgz-ac

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-version-2.tgz-ac, node2-zookeeper.log.gz, node3-version-2.tgz-aa, node3-version-2.tgz-ab, node3-version-2.tgz-ac, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Lukasz Osipiuk (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-------------------------------------

    Attachment:     (was: node2-version-2.tgz-aa)

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-zookeeper.log.gz, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?

Posted by "Lukasz Osipiuk (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lukasz Osipiuk updated ZOOKEEPER-713:
-------------------------------------

    Attachment: node2-version-2.tgz-ab

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz, node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-zookeeper.log.gz, node3-zookeeper.log.gz, zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.