You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Andrew Schwartzmeyer <an...@schwartzmeyer.com> on 2018/06/13 22:30:25 UTC

Review Request 67587: Updated ZooKeeper retry logic to retry on `ENOENT` too.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67587/
-----------------------------------------------------------

Review request for mesos, Joseph Wu and Neil Conway.


Bugs: MESOS-3790
    https://issues.apache.org/jira/browse/MESOS-3790


Repository: mesos


Description
-------

Per MESOS-3790, the call to `zookeeper_init` maps `EAI_NONAME` and
`EAI_NODATA` to an `errno` value of `ENOENT`, and all others except
`EAI_MEMORY` to `EINVAL`. Mesos's ZooKeeper logic is written to retry
this initialization for ten minutes if the error is `EINVAL`, and
should be updated to also retry if the error is `ENOENT`.

This is necessary because if the initialization is not retried, the
process crashes due to the `PLOG(FATAL)` call, and if it crashes, it
will interrupt other Mesos threads and potentially leave the
environment in an unknown state. For instance, we have seen
intermittent failures where the systemd unit file
`mesos_executors.slice` is created but empty because Mesos crashed
between creating the file and flushing the write to the file. This
then leads to errors when the agent is restarted (and succeeds to
connect to ZooKeeper), because the agent explicitly does not attempt
to write to the unit file if it already exists.


Diffs
-----

  src/zookeeper/zookeeper.cpp 52c4af192ccd1361afc4f7a0041889238c01e674 


Diff: https://reviews.apache.org/r/67587/diff/1/


Testing
-------

Testing against our repro right now, but it's flaky, so it'll take a while.


Thanks,

Andrew Schwartzmeyer


Re: Review Request 67587: Updated ZooKeeper retry logic to retry on `ENOENT` too.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67587/#review204749
-----------------------------------------------------------



PASS: Mesos patch 67587 was successfully built and tested.

Reviews applied: `['67587']`

All the build artifacts available at: http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/67587

- Mesos Reviewbot Windows


On June 13, 2018, 10:30 p.m., Andrew Schwartzmeyer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67587/
> -----------------------------------------------------------
> 
> (Updated June 13, 2018, 10:30 p.m.)
> 
> 
> Review request for mesos, Joseph Wu and Neil Conway.
> 
> 
> Bugs: MESOS-3790
>     https://issues.apache.org/jira/browse/MESOS-3790
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Per MESOS-3790, the call to `zookeeper_init` maps `EAI_NONAME` and
> `EAI_NODATA` to an `errno` value of `ENOENT`, and all others except
> `EAI_MEMORY` to `EINVAL`. Mesos's ZooKeeper logic is written to retry
> this initialization for ten minutes if the error is `EINVAL`, and
> should be updated to also retry if the error is `ENOENT`.
> 
> This is necessary because if the initialization is not retried, the
> process crashes due to the `PLOG(FATAL)` call, and if it crashes, it
> will interrupt other Mesos threads and potentially leave the
> environment in an unknown state. For instance, we have seen
> intermittent failures where the systemd unit file
> `mesos_executors.slice` is created but empty because Mesos crashed
> between creating the file and flushing the write to the file. This
> then leads to errors when the agent is restarted (and succeeds to
> connect to ZooKeeper), because the agent explicitly does not attempt
> to write to the unit file if it already exists.
> 
> 
> Diffs
> -----
> 
>   src/zookeeper/zookeeper.cpp 52c4af192ccd1361afc4f7a0041889238c01e674 
> 
> 
> Diff: https://reviews.apache.org/r/67587/diff/1/
> 
> 
> Testing
> -------
> 
> Testing against our repro right now, but it's flaky, so it'll take a while.
> 
> 
> Thanks,
> 
> Andrew Schwartzmeyer
> 
>


Re: Review Request 67587: Updated ZooKeeper retry logic to retry on `ENOENT` too.

Posted by Akash Gupta <ak...@hotmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67587/#review205027
-----------------------------------------------------------


Ship it!




Ship It!

- Akash Gupta


On June 13, 2018, 10:30 p.m., Andrew Schwartzmeyer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67587/
> -----------------------------------------------------------
> 
> (Updated June 13, 2018, 10:30 p.m.)
> 
> 
> Review request for mesos, Joseph Wu and Neil Conway.
> 
> 
> Bugs: MESOS-3790
>     https://issues.apache.org/jira/browse/MESOS-3790
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Per MESOS-3790, the call to `zookeeper_init` maps `EAI_NONAME` and
> `EAI_NODATA` to an `errno` value of `ENOENT`, and all others except
> `EAI_MEMORY` to `EINVAL`. Mesos's ZooKeeper logic is written to retry
> this initialization for ten minutes if the error is `EINVAL`, and
> should be updated to also retry if the error is `ENOENT`.
> 
> This is necessary because if the initialization is not retried, the
> process crashes due to the `PLOG(FATAL)` call, and if it crashes, it
> will interrupt other Mesos threads and potentially leave the
> environment in an unknown state. For instance, we have seen
> intermittent failures where the systemd unit file
> `mesos_executors.slice` is created but empty because Mesos crashed
> between creating the file and flushing the write to the file. This
> then leads to errors when the agent is restarted (and succeeds to
> connect to ZooKeeper), because the agent explicitly does not attempt
> to write to the unit file if it already exists.
> 
> 
> Diffs
> -----
> 
>   src/zookeeper/zookeeper.cpp 52c4af192ccd1361afc4f7a0041889238c01e674 
> 
> 
> Diff: https://reviews.apache.org/r/67587/diff/1/
> 
> 
> Testing
> -------
> 
> Testing against our repro right now, but it's flaky, so it'll take a while.
> 
> 
> Thanks,
> 
> Andrew Schwartzmeyer
> 
>