You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Qian Zhang <zh...@gmail.com> on 2019/02/13 08:26:43 UTC

Review Request 69972: Skipped the container which has no checkpointed volumes during recovery.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/
-----------------------------------------------------------

Review request for mesos, Andrei Budnik and Gilbert Song.


Bugs: MESOS-9507
    https://issues.apache.org/jira/browse/MESOS-9507


Repository: mesos


Description
-------

There are two cases we need to handle:
  1. The checkpointed docker volumes file does not exist.
  2. The checkpointed docker volumes file is empty.
For both of the two cases, in the recovery of `docker/volume` isolator,
we should remove the container's checkpoint directory and then skip the
container.


Diffs
-----

  src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp a72fc84da6fb0f24d363dd4c635500510da675d8 


Diff: https://reviews.apache.org/r/69972/diff/1/


Testing
-------


Thanks,

Qian Zhang


Re: Review Request 69972: Handled containers which has no checkpointed volumes during recovery.

Posted by Qian Zhang <zh...@gmail.com>.

> On Feb. 18, 2019, 11:48 p.m., Andrei Budnik wrote:
> > src/slave/containerizer/mesos/isolators/docker/volume/isolator.hpp
> > Line 92 (original), 92 (patched)
> > <https://reviews.apache.org/r/69972/diff/3/?file=2125760#file2125760line92>
> >
> >     `orphaned` sounds more straightforward to me than `recoverable`. Especially, when you are not familiar with our containerization subsystem.

Had a discussion with Gilbert, we agree we'd better not to differentiate recoverable containers and orphaned containers in this patch. What you pointed out (recoverable containers must not have empty checkpoint file) actualy exists in other isolators as well (e.g., CNI), we may consider to improve them in a separate patch in future.


- Qian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212902
-----------------------------------------------------------


On Feb. 20, 2019, 2:17 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> -----------------------------------------------------------
> 
> (Updated Feb. 20, 2019, 2:17 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
>     https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> There are two cases we need to handle:
>   1. The container's checkpointed docker volumes file does not exist.
>   2. The container's checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should construct an info object with empty docker volumes for the
> container and rely on containerizer or `docker/volume` isolator's
> `recover` method to cleanup the container.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/4/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Re: Review Request 69972: Skipped orphan container which has no checkpointed volumes in recovery.

Posted by Andrei Budnik <ab...@mesosphere.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212902
-----------------------------------------------------------




src/slave/containerizer/mesos/isolators/docker/volume/isolator.hpp
Line 92 (original), 92 (patched)
<https://reviews.apache.org/r/69972/#comment298782>

    `orphaned` sounds more straightforward to me than `recoverable`. Especially, when you are not familiar with our containerization subsystem.


- Andrei Budnik


On Feb. 18, 2019, 2:31 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> -----------------------------------------------------------
> 
> (Updated Feb. 18, 2019, 2:31 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
>     https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the orphan container's checkpoint directory and then
> skip the container. But for recoverable containers, we return an error
> since the above two cases should not happen for them.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.hpp 2fd0493bc3e1749c12b81b9f2d281b86cb2c410f 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/3/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Re: Review Request 69972: Handled containers which has no checkpointed volumes during recovery.

Posted by Andrei Budnik <ab...@mesosphere.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212956
-----------------------------------------------------------


Ship it!




Ship It!

- Andrei Budnik


On Feb. 20, 2019, 6:17 a.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> -----------------------------------------------------------
> 
> (Updated Feb. 20, 2019, 6:17 a.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
>     https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> There are two cases we need to handle:
>   1. The container's checkpointed docker volumes file does not exist.
>   2. The container's checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should construct an info object with empty docker volumes for the
> container and rely on containerizer or `docker/volume` isolator's
> `recover` method to cleanup the container.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/4/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Re: Review Request 69972: Handled containers which has no checkpointed volumes during recovery.

Posted by Gilbert Song <so...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212982
-----------------------------------------------------------


Ship it!




Ship It!

- Gilbert Song


On Feb. 19, 2019, 10:17 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> -----------------------------------------------------------
> 
> (Updated Feb. 19, 2019, 10:17 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
>     https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> There are two cases we need to handle:
>   1. The container's checkpointed docker volumes file does not exist.
>   2. The container's checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should construct an info object with empty docker volumes for the
> container and rely on containerizer or `docker/volume` isolator's
> `recover` method to cleanup the container.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/4/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Re: Review Request 69972: Handled containers which has no checkpointed volumes during recovery.

Posted by Qian Zhang <zh...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/
-----------------------------------------------------------

(Updated Feb. 20, 2019, 2:17 p.m.)


Review request for mesos, Andrei Budnik and Gilbert Song.


Changes
-------

Reused `cleanup` to remove container directory rather than doing it directly in `_recover`.


Summary (updated)
-----------------

Handled containers which has no checkpointed volumes during recovery.


Bugs: MESOS-9507
    https://issues.apache.org/jira/browse/MESOS-9507


Repository: mesos


Description (updated)
-------

There are two cases we need to handle:
  1. The container's checkpointed docker volumes file does not exist.
  2. The container's checkpointed docker volumes file is empty.
For both of the two cases, in the recovery of `docker/volume` isolator,
we should construct an info object with empty docker volumes for the
container and rely on containerizer or `docker/volume` isolator's
`recover` method to cleanup the container.


Diffs (updated)
-----

  src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp a72fc84da6fb0f24d363dd4c635500510da675d8 


Diff: https://reviews.apache.org/r/69972/diff/4/

Changes: https://reviews.apache.org/r/69972/diff/3-4/


Testing
-------


Thanks,

Qian Zhang


Re: Review Request 69972: Skipped orphan container which has no checkpointed volumes in recovery.

Posted by Qian Zhang <zh...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/
-----------------------------------------------------------

(Updated Feb. 18, 2019, 10:31 p.m.)


Review request for mesos, Andrei Budnik and Gilbert Song.


Changes
-------

Minor changes.


Bugs: MESOS-9507
    https://issues.apache.org/jira/browse/MESOS-9507


Repository: mesos


Description
-------

There are two cases we need to handle:
  1. The checkpointed docker volumes file does not exist.
  2. The checkpointed docker volumes file is empty.
For both of the two cases, in the recovery of `docker/volume` isolator,
we should remove the orphan container's checkpoint directory and then
skip the container. But for recoverable containers, we return an error
since the above two cases should not happen for them.


Diffs (updated)
-----

  src/slave/containerizer/mesos/isolators/docker/volume/isolator.hpp 2fd0493bc3e1749c12b81b9f2d281b86cb2c410f 
  src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp a72fc84da6fb0f24d363dd4c635500510da675d8 


Diff: https://reviews.apache.org/r/69972/diff/3/

Changes: https://reviews.apache.org/r/69972/diff/2-3/


Testing
-------


Thanks,

Qian Zhang


Re: Review Request 69972: Skipped orphan container which has no checkpointed volumes in recovery.

Posted by Qian Zhang <zh...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/
-----------------------------------------------------------

(Updated Feb. 18, 2019, 10:26 p.m.)


Review request for mesos, Andrei Budnik and Gilbert Song.


Changes
-------

Addressed review comments.


Summary (updated)
-----------------

Skipped orphan container which has no checkpointed volumes in recovery.


Bugs: MESOS-9507
    https://issues.apache.org/jira/browse/MESOS-9507


Repository: mesos


Description (updated)
-------

There are two cases we need to handle:
  1. The checkpointed docker volumes file does not exist.
  2. The checkpointed docker volumes file is empty.
For both of the two cases, in the recovery of `docker/volume` isolator,
we should remove the orphan container's checkpoint directory and then
skip the container. But for recoverable containers, we return an error
since the above two cases should not happen for them.


Diffs (updated)
-----

  src/slave/containerizer/mesos/isolators/docker/volume/isolator.hpp 2fd0493bc3e1749c12b81b9f2d281b86cb2c410f 
  src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp a72fc84da6fb0f24d363dd4c635500510da675d8 


Diff: https://reviews.apache.org/r/69972/diff/2/

Changes: https://reviews.apache.org/r/69972/diff/1-2/


Testing
-------


Thanks,

Qian Zhang


Re: Review Request 69972: Skipped the container which has no checkpointed volumes during recovery.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212791
-----------------------------------------------------------



PASS: Mesos patch 69972 was successfully built and tested.

Reviews applied: `['69972']`

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2884/mesos-review-69972

- Mesos Reviewbot Windows


On Feb. 13, 2019, 8:26 a.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> -----------------------------------------------------------
> 
> (Updated Feb. 13, 2019, 8:26 a.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
>     https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the container's checkpoint directory and then skip the
> container.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Re: Review Request 69972: Skipped the container which has no checkpointed volumes during recovery.

Posted by Andrei Budnik <ab...@mesosphere.com>.

> On Feb. 13, 2019, 1:28 p.m., Andrei Budnik wrote:
> > src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp
> > Lines 300 (patched)
> > <https://reviews.apache.org/r/69972/diff/1/?file=2125066#file2125066line300>
> >
> >     Given that `state::checkpoint` is **atomic**, we can not end up in the state where the file is empty because the agent did not finish writing to it.
> >     
> >     However, an empty file might occur in case of hard reboot of the agent's host. This happens because page cache is dumped every 20 seconds by default in Linux. There is a chance that the file is created, but data has not yet synced on disk.
> >     
> >     As we have agreed with Gilbert, we need to ignore empty files **only** in case of orphan containers.
> 
> Qian Zhang wrote:
>     > Given that state::checkpoint is atomic, we can not end up in the state where the file is empty because the agent did not finish writing to it.
>     > However, an empty file might occur in case of hard reboot of the agent's host. This happens because page cache is dumped every 20 seconds by default in Linux. There is a chance that the file is created, but data has not yet synced on disk.
>     
>     Agree. And I see the comment `// This could happen if the slave died after opening the file for writing but before it checkpointed anything.` in a couple of places in Mesos code (e.g., `slave/state.cpp`, `metadata_manager.cpp`), I think those comments need to be updated as well.
>     
>     > As we have agreed with Gilbert, we need to ignore empty files only in case of orphan containers.
>     
>     Can you please elaborate a bit? Why do we want to treat orphan containers and recoverable containers differently? How will we handle the recoverable containers in this case?

> I think those comments need to be updated as well.

We can update these comments in a separate patch later.

> Why do we want to treat orphan containers and recoverable containers differently? How will we handle the recoverable containers in this case?

I think that `recoverable` containers could not have empty state files by construction as checkpointing state is atomic. If this invariant does not satisfy, then we definitely have a bug in our code which we need to fix.
In the case of hard reboots, this invariant might be broken, but all containers are `orphan`.

If the reasoning above looks acceptable, then we might want to recover after broken invariant for `orphan` containers, while keeping this error for `non-orphan` containers as an assertion (for us, developers) that the invariant could not be broken in normal circumstances.


- Andrei


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212795
-----------------------------------------------------------


On Feb. 13, 2019, 8:26 a.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> -----------------------------------------------------------
> 
> (Updated Feb. 13, 2019, 8:26 a.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
>     https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the container's checkpoint directory and then skip the
> container.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Re: Review Request 69972: Skipped the container which has no checkpointed volumes during recovery.

Posted by Qian Zhang <zh...@gmail.com>.

> On Feb. 13, 2019, 9:28 p.m., Andrei Budnik wrote:
> > src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp
> > Lines 300 (patched)
> > <https://reviews.apache.org/r/69972/diff/1/?file=2125066#file2125066line300>
> >
> >     Given that `state::checkpoint` is **atomic**, we can not end up in the state where the file is empty because the agent did not finish writing to it.
> >     
> >     However, an empty file might occur in case of hard reboot of the agent's host. This happens because page cache is dumped every 20 seconds by default in Linux. There is a chance that the file is created, but data has not yet synced on disk.
> >     
> >     As we have agreed with Gilbert, we need to ignore empty files **only** in case of orphan containers.
> 
> Qian Zhang wrote:
>     > Given that state::checkpoint is atomic, we can not end up in the state where the file is empty because the agent did not finish writing to it.
>     > However, an empty file might occur in case of hard reboot of the agent's host. This happens because page cache is dumped every 20 seconds by default in Linux. There is a chance that the file is created, but data has not yet synced on disk.
>     
>     Agree. And I see the comment `// This could happen if the slave died after opening the file for writing but before it checkpointed anything.` in a couple of places in Mesos code (e.g., `slave/state.cpp`, `metadata_manager.cpp`), I think those comments need to be updated as well.
>     
>     > As we have agreed with Gilbert, we need to ignore empty files only in case of orphan containers.
>     
>     Can you please elaborate a bit? Why do we want to treat orphan containers and recoverable containers differently? How will we handle the recoverable containers in this case?
> 
> Andrei Budnik wrote:
>     > I think those comments need to be updated as well.
>     
>     We can update these comments in a separate patch later.
>     
>     > Why do we want to treat orphan containers and recoverable containers differently? How will we handle the recoverable containers in this case?
>     
>     I think that `recoverable` containers could not have empty state files by construction as checkpointing state is atomic. If this invariant does not satisfy, then we definitely have a bug in our code which we need to fix.
>     In the case of hard reboots, this invariant might be broken, but all containers are `orphan`.
>     
>     If the reasoning above looks acceptable, then we might want to recover after broken invariant for `orphan` containers, while keeping this error for `non-orphan` containers as an assertion (for us, developers) that the invariant could not be broken in normal circumstances.

Agree, and I posted a patch for updating comments here: https://reviews.apache.org/r/70001/


- Qian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212795
-----------------------------------------------------------


On Feb. 13, 2019, 4:26 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> -----------------------------------------------------------
> 
> (Updated Feb. 13, 2019, 4:26 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
>     https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the container's checkpoint directory and then skip the
> container.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Re: Review Request 69972: Skipped the container which has no checkpointed volumes during recovery.

Posted by Qian Zhang <zh...@gmail.com>.

> On Feb. 13, 2019, 9:28 p.m., Andrei Budnik wrote:
> > Thanks for the patch!
> > I think we should implement a test for this. Otherwise, it would be very dangerous to _refactor_ this part of code in the future.
> > If you have no chance to implement a test for this now, please feel free to file a ticket.

Sure, I posted a unit test here: https://reviews.apache.org/r/69994/


- Qian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212795
-----------------------------------------------------------


On Feb. 13, 2019, 4:26 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> -----------------------------------------------------------
> 
> (Updated Feb. 13, 2019, 4:26 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
>     https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the container's checkpoint directory and then skip the
> container.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Re: Review Request 69972: Skipped the container which has no checkpointed volumes during recovery.

Posted by Qian Zhang <zh...@gmail.com>.

> On Feb. 13, 2019, 9:28 p.m., Andrei Budnik wrote:
> > src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp
> > Lines 300 (patched)
> > <https://reviews.apache.org/r/69972/diff/1/?file=2125066#file2125066line300>
> >
> >     Given that `state::checkpoint` is **atomic**, we can not end up in the state where the file is empty because the agent did not finish writing to it.
> >     
> >     However, an empty file might occur in case of hard reboot of the agent's host. This happens because page cache is dumped every 20 seconds by default in Linux. There is a chance that the file is created, but data has not yet synced on disk.
> >     
> >     As we have agreed with Gilbert, we need to ignore empty files **only** in case of orphan containers.

> Given that state::checkpoint is atomic, we can not end up in the state where the file is empty because the agent did not finish writing to it.
> However, an empty file might occur in case of hard reboot of the agent's host. This happens because page cache is dumped every 20 seconds by default in Linux. There is a chance that the file is created, but data has not yet synced on disk.

Agree. And I see the comment `// This could happen if the slave died after opening the file for writing but before it checkpointed anything.` in a couple of places in Mesos code (e.g., `slave/state.cpp`, `metadata_manager.cpp`), I think those comments need to be updated as well.

> As we have agreed with Gilbert, we need to ignore empty files only in case of orphan containers.

Can you please elaborate a bit? Why do we want to treat orphan containers and recoverable containers differently? How will we handle the recoverable containers in this case?


- Qian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212795
-----------------------------------------------------------


On Feb. 13, 2019, 4:26 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> -----------------------------------------------------------
> 
> (Updated Feb. 13, 2019, 4:26 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
>     https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the container's checkpoint directory and then skip the
> container.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Re: Review Request 69972: Skipped the container which has no checkpointed volumes during recovery.

Posted by Andrei Budnik <ab...@mesosphere.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212795
-----------------------------------------------------------



Thanks for the patch!
I think we should implement a test for this. Otherwise, it would be very dangerous to _refactor_ this part of code in the future.
If you have no chance to implement a test for this now, please feel free to file a ticket.


src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp
Lines 300 (patched)
<https://reviews.apache.org/r/69972/#comment298698>

    Given that `state::checkpoint` is **atomic**, we can not end up in the state where the file is empty because the agent did not finish writing to it.
    
    However, an empty file might occur in case of hard reboot of the agent's host. This happens because page cache is dumped every 20 seconds by default in Linux. There is a chance that the file is created, but data has not yet synced on disk.
    
    As we have agreed with Gilbert, we need to ignore empty files **only** in case of orphan containers.


- Andrei Budnik


On Feb. 13, 2019, 8:26 a.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> -----------------------------------------------------------
> 
> (Updated Feb. 13, 2019, 8:26 a.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
>     https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the container's checkpoint directory and then skip the
> container.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Re: Review Request 69972: Skipped the container which has no checkpointed volumes during recovery.

Posted by Mesos Reviewbot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69972/#review212803
-----------------------------------------------------------



Patch looks great!

Reviews applied: [69972]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose --disable-libtool-wrappers --disable-parallel-test-execution' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh

- Mesos Reviewbot


On Feb. 13, 2019, 8:26 a.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69972/
> -----------------------------------------------------------
> 
> (Updated Feb. 13, 2019, 8:26 a.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9507
>     https://issues.apache.org/jira/browse/MESOS-9507
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> There are two cases we need to handle:
>   1. The checkpointed docker volumes file does not exist.
>   2. The checkpointed docker volumes file is empty.
> For both of the two cases, in the recovery of `docker/volume` isolator,
> we should remove the container's checkpoint directory and then skip the
> container.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp a72fc84da6fb0f24d363dd4c635500510da675d8 
> 
> 
> Diff: https://reviews.apache.org/r/69972/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>