You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@mesos.apache.org by Chun-Hung Hsiao <ch...@apache.org> on 2018/10/12 23:20:00 UTC

Review Request 69010: Synced SLRP checkpoints to the filesystem.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/
-----------------------------------------------------------

Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.


Bugs: MESOS-9281
    https://issues.apache.org/jira/browse/MESOS-9281


Repository: mesos


Description
-------

Currently if a system crashes, SLRP checkpoints might not be synced to
the filesystem, so it is possible that an old or empty checkpoint will
be read upon recovery. Moreover, if a CSI call has been issued right
before the crash, the recovered state may be inconsistent with the
actual state reported by the plugin. For example, the plugin might have
created a volume but the checkpointed state does not know about it.

To avoid this inconsistency, we always call fsync()  when checkpointing
SLRP states.


Diffs
-----

  src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
  src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 


Diff: https://reviews.apache.org/r/69010/diff/1/


Testing
-------

make check


Thanks,

Chun-Hung Hsiao

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Chun-Hung Hsiao <ch...@apache.org>.


> On Oct. 13, 2018, 2:49 a.m., James DeFelice wrote:
> > src/slave/state.hpp
> > Line 192 (original), 196 (patched)
> > <https://reviews.apache.org/r/69010/diff/1/?file=2096858#file2096858line196>
> >
> >     Why is the default `false` here? If someone is calling the `checkpoint` func because they want atomic semantics ... wouldn't they *normally* want `sync==true` (and shouldn't the *exception* to the rule be `sync==false`)?
> 
> Chun-Hung Hsiao wrote:
>     This is to maintain the same semantics for original uses of this function. IIRC most of the components are able to handle stale or empty checkpoints. Also defaulting the flag to true will bring in some performance impact.

Some numbers for fsync found on the Internet: https://gist.github.com/prashanthpai/e246be62656f25d7e31b
For now I'll just leave a todo for enabling syncing by default. Does this sound okay to you?


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209512
-----------------------------------------------------------


On Oct. 16, 2018, 2:48 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 16, 2018, 2:48 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/3/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Chun-Hung Hsiao <ch...@apache.org>.


> On Oct. 13, 2018, 2:49 a.m., James DeFelice wrote:
> > src/slave/state.hpp
> > Line 192 (original), 196 (patched)
> > <https://reviews.apache.org/r/69010/diff/1/?file=2096858#file2096858line196>
> >
> >     Why is the default `false` here? If someone is calling the `checkpoint` func because they want atomic semantics ... wouldn't they *normally* want `sync==true` (and shouldn't the *exception* to the rule be `sync==false`)?

This is to maintain the same semantics for original uses of this function. IIRC most of the components are able to handle stale or empty checkpoints. Also defaulting the flag to true will bring in some performance impact.


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209512
-----------------------------------------------------------


On Oct. 12, 2018, 11:19 p.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 12, 2018, 11:19 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by James DeFelice <ja...@gmail.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209512
-----------------------------------------------------------




src/slave/state.hpp
Line 192 (original), 196 (patched)
<https://reviews.apache.org/r/69010/#comment293974>

    Why is the default `false` here? If someone is calling the `checkpoint` func because they want atomic semantics ... wouldn't they *normally* want `sync==true` (and shouldn't the *exception* to the rule be `sync==false`)?


- James DeFelice


On Oct. 12, 2018, 11:19 p.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 12, 2018, 11:19 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209579
-----------------------------------------------------------



PASS: Mesos patch 69010 was successfully built and tested.

Reviews applied: `['69009', '69010']`

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2466/mesos-review-69010

- Mesos Reviewbot Windows


On Oct. 15, 2018, 10:08 p.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 15, 2018, 10:08 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/2/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209586
-----------------------------------------------------------



PASS: Mesos patch 69010 was successfully built and tested.

Reviews applied: `['69009', '69010']`

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2467/mesos-review-69010

- Mesos Reviewbot Windows


On Oct. 16, 2018, 2:48 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 16, 2018, 2:48 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/3/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Benjamin Bannier <be...@mesosphere.io>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209691
-----------------------------------------------------------


Fix it, then Ship it!





src/resource_provider/storage/provider.cpp
Line 1176 (original), 1189 (patched)
<https://reviews.apache.org/r/69010/#comment294246>

    Any reason we don't just store the `uuid` values in a container and directly garbage collect them after checkpointing without going through (potentially multiple) dispatch calls? I personally would prefer to handle this synchronously.


- Benjamin Bannier


On Oct. 16, 2018, 6:46 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 16, 2018, 6:46 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/4/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209720
-----------------------------------------------------------



PASS: Mesos patch 69010 was successfully built and tested.

Reviews applied: `['69009', '69010']`

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2481/mesos-review-69010

- Mesos Reviewbot Windows


On Oct. 17, 2018, 8:30 p.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 17, 2018, 8:30 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/5/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Benjamin Bannier <be...@mesosphere.io>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209982
-----------------------------------------------------------


Ship it!




Ship It!

- Benjamin Bannier


On Oct. 18, 2018, 4:01 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 18, 2018, 4:01 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/6/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Mesos Reviewbot <re...@mesos.apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review210216
-----------------------------------------------------------



Patch looks great!

Reviews applied: [69009, 69085, 69010]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose --disable-libtool-wrappers' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh

- Mesos Reviewbot


On Oct. 30, 2018, 9:08 p.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 30, 2018, 9:08 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/7/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review210200
-----------------------------------------------------------



FAIL: Some of the unit tests failed. Please check the relevant logs.

Reviews applied: `['69009', '69085', '69010']`

Failed command: `Start-MesosCITesting`

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2552/mesos-review-69010

Relevant logs:

- [mesos-tests.log](http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2552/mesos-review-69010/logs/mesos-tests.log):

```
I1030 23:22:25.074347  4972 executor.cpp:918] Sending SIGTERM to process tree at pid 710ing shutdown framework 4ab0579b-d229-4918-86e3-d71a8f53c9ef-0000 because it is terminating
I1030 23:22:25.076347  1988 master.cpp:1273] Agent 4ab0579b-d229-4918-86e3-d71a8f53c9ef-S0 at slave(461)@192.10.1.7:51130 (windows-04.aa0q4n2kgcyefckmv0xukjvy4f.xx.internal.cloudapp.net) disconnected
I1030 23:22:25.076347  1988 master.cpp:3289] Disconnecting agent 4ab0579b-d229-4918-86e3-d71a8f53c9ef-S0 at slave(461)@192.10.1.7:51130 (windows-04.aa0q4n2kgcyefckmv0xukjvy4f.xx.internal.cloudapp.net)
I1030 23:22:25.076347  1988 master.cpp:3308] Deactivating agent 4ab0579b-d229-4918-86e3-d71a8f53c9ef-S0 at slave(461)@192.10.1.7:51130 (windows-04.aa0q4n2kgcyefckmv0xukjvy4f.xx.internal.cloudapp.net)
I1030 23:22:25.076347  7404 hierarchical.cpp:357] Removed framework 4ab0579b-d229-4918-86e3-d71a8f53c9ef-0000
I1030 23:22:25.077354  7404 hierarchical.cpp:801] Agent 4ab0579b-d229-4918-86e3-d71a8f53c9ef-S0 deactivated
I1030 23:22:25.078361  6516 containerizer.cpp:2455] Destroying container b06dea4d-5f82-4a26-8d1f-8cac5d1b8a74 in R[       OK ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0 (588 ms)
[----------] 1 test from IsolationFlag/MemoryIsolatorTest (606 ms total)

[----------] Global test environment tear-down
[==========] 1053 tests from 103 test cases ran. (522189 ms total)
[  PASSED  ] 1052 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] DockerFetcherPluginTest.INTERNET_CURL_InvokeFetchByName

 1 FAILED TEST
  YOU HAVE 231 DISABLED TESTS

UNNING state
I1030 23:22:25.078361  6516 containerizer.cpp:3122] Transitioning the state of container b06dea4d-5f82-4a26-8d1f-8cac5d1b8a74 from RUNNING to DESTROYING
I1030 23:22:25.078361  6516 launcher.cpp:166] Asked to destroy container b06dea4d-5f82-4a26-8d1f-8cac5d1b8a74
W1030 23:22:25.079360  1520 process.cpp:1423] Failed to recv on socket WindowsFD::Type::SOCKET=984 to peer '192.10.1.7:52974': IO failed with error code: The specified network name is no longer available.

W1030 23:22:25.080353  1520 process.cpp:838] Failed to recv on socket WindowsFD::Type::SOCKET=2968 to peer '192.10.1.7:52975': IO failed with error code: The specified network name is no longer available.

I1030 23:22:25.111367  5096 containerizer.cpp:2961] Container b06dea4d-5f82-4a26-8d1f-8cac5d1b8a74 has exited
I1030 23:22:25.140349  7296 master.cpp:1115] Master terminating
I1030 23:22:25.141355  6516 hierarchical.cpp:643] Removed agent 4ab0579b-d229-4918-86e3-d71a8f53c9ef-S0
I1030 23:22:25.381356  1520 process.cpp:927] Stopped the socket accept loop
```

- Mesos Reviewbot Windows


On Oct. 30, 2018, 2:08 p.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 30, 2018, 2:08 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/7/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review210238
-----------------------------------------------------------



PASS: Mesos patch 69010 was successfully built and tested.

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2558/mesos-review-69010

- Mesos Reviewbot Windows


On Oct. 31, 2018, 7:21 p.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 31, 2018, 7:21 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/8/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Chun-Hung Hsiao <ch...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/
-----------------------------------------------------------

(Updated Oct. 31, 2018, 6:21 p.m.)


Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.


Changes
-------

Added more comments.


Bugs: MESOS-9281
    https://issues.apache.org/jira/browse/MESOS-9281


Repository: mesos


Description
-------

Currently if a system crashes, SLRP checkpoints might not be synced to
the filesystem, so it is possible that an old or empty checkpoint will
be read upon recovery. Moreover, if a CSI call has been issued right
before the crash, the recovered state may be inconsistent with the
actual state reported by the plugin. For example, the plugin might have
created a volume but the checkpointed state does not know about it.

To avoid this inconsistency, we always call fsync()  when checkpointing
SLRP states.


Diffs (updated)
-----

  src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
  src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 


Diff: https://reviews.apache.org/r/69010/diff/8/

Changes: https://reviews.apache.org/r/69010/diff/7-8/


Testing
-------

make check


Thanks,

Chun-Hung Hsiao

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Chun-Hung Hsiao <ch...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/
-----------------------------------------------------------

(Updated Oct. 30, 2018, 9:08 p.m.)


Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.


Changes
-------

Enabled syncing when calling `mkdir` in `slave::state::checkpoint`.


Bugs: MESOS-9281
    https://issues.apache.org/jira/browse/MESOS-9281


Repository: mesos


Description
-------

Currently if a system crashes, SLRP checkpoints might not be synced to
the filesystem, so it is possible that an old or empty checkpoint will
be read upon recovery. Moreover, if a CSI call has been issued right
before the crash, the recovered state may be inconsistent with the
actual state reported by the plugin. For example, the plugin might have
created a volume but the checkpointed state does not know about it.

To avoid this inconsistency, we always call fsync()  when checkpointing
SLRP states.


Diffs (updated)
-----

  src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
  src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 


Diff: https://reviews.apache.org/r/69010/diff/7/

Changes: https://reviews.apache.org/r/69010/diff/6-7/


Testing
-------

make check


Thanks,

Chun-Hung Hsiao

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209800
-----------------------------------------------------------



PASS: Mesos patch 69010 was successfully built and tested.

Reviews applied: `['69009', '69085', '69010']`

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2492/mesos-review-69010

- Mesos Reviewbot Windows


On Oct. 18, 2018, 2:01 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 18, 2018, 2:01 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/6/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209733
-----------------------------------------------------------



PASS: Mesos patch 69010 was successfully built and tested.

Reviews applied: `['69009', '69010']`

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2485/mesos-review-69010

- Mesos Reviewbot Windows


On Oct. 18, 2018, 2:01 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 18, 2018, 2:01 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/6/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Chun-Hung Hsiao <ch...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/
-----------------------------------------------------------

(Updated Oct. 18, 2018, 2:01 a.m.)


Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.


Changes
-------

Synced CSI volume state checkpoints.


Bugs: MESOS-9281
    https://issues.apache.org/jira/browse/MESOS-9281


Repository: mesos


Description
-------

Currently if a system crashes, SLRP checkpoints might not be synced to
the filesystem, so it is possible that an old or empty checkpoint will
be read upon recovery. Moreover, if a CSI call has been issued right
before the crash, the recovered state may be inconsistent with the
actual state reported by the plugin. For example, the plugin might have
created a volume but the checkpointed state does not know about it.

To avoid this inconsistency, we always call fsync()  when checkpointing
SLRP states.


Diffs (updated)
-----

  src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
  src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 


Diff: https://reviews.apache.org/r/69010/diff/6/

Changes: https://reviews.apache.org/r/69010/diff/5-6/


Testing
-------

make check


Thanks,

Chun-Hung Hsiao

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Mesos Reviewbot <re...@mesos.apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209715
-----------------------------------------------------------



Patch looks great!

Reviews applied: [69009, 69010]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose --disable-libtool-wrappers' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh

- Mesos Reviewbot


On Oct. 17, 2018, 10:30 p.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 17, 2018, 10:30 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/5/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Chun-Hung Hsiao <ch...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/
-----------------------------------------------------------

(Updated Oct. 17, 2018, 8:30 p.m.)


Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.


Changes
-------

Addressed Benjamin's comment.


Bugs: MESOS-9281
    https://issues.apache.org/jira/browse/MESOS-9281


Repository: mesos


Description
-------

Currently if a system crashes, SLRP checkpoints might not be synced to
the filesystem, so it is possible that an old or empty checkpoint will
be read upon recovery. Moreover, if a CSI call has been issued right
before the crash, the recovered state may be inconsistent with the
actual state reported by the plugin. For example, the plugin might have
created a volume but the checkpointed state does not know about it.

To avoid this inconsistency, we always call fsync()  when checkpointing
SLRP states.


Diffs (updated)
-----

  src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
  src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 


Diff: https://reviews.apache.org/r/69010/diff/5/

Changes: https://reviews.apache.org/r/69010/diff/4-5/


Testing
-------

make check


Thanks,

Chun-Hung Hsiao

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209597
-----------------------------------------------------------



PASS: Mesos patch 69010 was successfully built and tested.

Reviews applied: `['69009', '69010']`

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2471/mesos-review-69010

- Mesos Reviewbot Windows


On Oct. 16, 2018, 4:46 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 16, 2018, 4:46 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/4/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Chun-Hung Hsiao <ch...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/
-----------------------------------------------------------

(Updated Oct. 16, 2018, 4:46 a.m.)


Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.


Changes
-------

Left a TODO for enabling syncing by default.


Bugs: MESOS-9281
    https://issues.apache.org/jira/browse/MESOS-9281


Repository: mesos


Description
-------

Currently if a system crashes, SLRP checkpoints might not be synced to
the filesystem, so it is possible that an old or empty checkpoint will
be read upon recovery. Moreover, if a CSI call has been issued right
before the crash, the recovered state may be inconsistent with the
actual state reported by the plugin. For example, the plugin might have
created a volume but the checkpointed state does not know about it.

To avoid this inconsistency, we always call fsync()  when checkpointing
SLRP states.


Diffs (updated)
-----

  src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
  src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 


Diff: https://reviews.apache.org/r/69010/diff/4/

Changes: https://reviews.apache.org/r/69010/diff/3-4/


Testing
-------

make check


Thanks,

Chun-Hung Hsiao

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Chun-Hung Hsiao <ch...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/
-----------------------------------------------------------

(Updated Oct. 16, 2018, 2:48 a.m.)


Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.


Changes
-------

Addressed some of Benjamin's comments.


Bugs: MESOS-9281
    https://issues.apache.org/jira/browse/MESOS-9281


Repository: mesos


Description
-------

Currently if a system crashes, SLRP checkpoints might not be synced to
the filesystem, so it is possible that an old or empty checkpoint will
be read upon recovery. Moreover, if a CSI call has been issued right
before the crash, the recovered state may be inconsistent with the
actual state reported by the plugin. For example, the plugin might have
created a volume but the checkpointed state does not know about it.

To avoid this inconsistency, we always call fsync()  when checkpointing
SLRP states.


Diffs (updated)
-----

  src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
  src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 


Diff: https://reviews.apache.org/r/69010/diff/3/

Changes: https://reviews.apache.org/r/69010/diff/2-3/


Testing
-------

make check


Thanks,

Chun-Hung Hsiao

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Chun-Hung Hsiao <ch...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/
-----------------------------------------------------------

(Updated Oct. 15, 2018, 10:08 p.m.)


Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.


Changes
-------

Fixed a comment in another patch.


Bugs: MESOS-9281
    https://issues.apache.org/jira/browse/MESOS-9281


Repository: mesos


Description
-------

Currently if a system crashes, SLRP checkpoints might not be synced to
the filesystem, so it is possible that an old or empty checkpoint will
be read upon recovery. Moreover, if a CSI call has been issued right
before the crash, the recovered state may be inconsistent with the
actual state reported by the plugin. For example, the plugin might have
created a volume but the checkpointed state does not know about it.

To avoid this inconsistency, we always call fsync()  when checkpointing
SLRP states.


Diffs (updated)
-----

  src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
  src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 


Diff: https://reviews.apache.org/r/69010/diff/2/

Changes: https://reviews.apache.org/r/69010/diff/1-2/


Testing
-------

make check


Thanks,

Chun-Hung Hsiao

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209508
-----------------------------------------------------------



FAIL: Some of the unit tests failed. Please check the relevant logs.

Reviews applied: `['69009', '69010']`

Failed command: `Start-MesosCITesting`

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2459/mesos-review-69010

Relevant logs:

- [mesos-tests-cmake.log](http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2459/mesos-review-69010/logs/mesos-tests-cmake.log):

```
         d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\winport.c(256): warning C4090: 'function': different 'const' qualifiers [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\zookeeper.vcxproj] [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
         d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\winport.c(166): warning C4716: 'pthread_cond_broadcast': must return a value [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\zookeeper.vcxproj] [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
         d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\winport.c(205): warning C4716: 'pthread_cond_wait': must return a value [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\zookeeper.vcxproj] [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
         d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\cli.c(124): warning C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\cli.vcxproj] [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
         d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\cli.c(128): warning C4267: 'initializing': conversion from 'size_t' to 'int', possible loss of data [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\cli.vcxproj] [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
         d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\cli.c(279): warning C4267: 'function': conversion from 'size_t' to 'int', possible loss of data [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\cli.vcxproj] [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
         d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\cli.c(301): warning C4267: 'initializing': conversion from 'size_t' to 'int', possible loss of data [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\cli.vcxproj] [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
         d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\cli.c(368): warning C4267: 'function': conversion from 'size_t' to 'int', possible loss of data [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\cli.vcxproj] [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
         d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\cli.c(372): warning C4267: 'function': conversion from 'size_t' to 'int', possible loss of data [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\cli.vcxproj] [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
         d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\cli.c(512): warning C4267: 'function': conversion from 'size_t' to 'int', possible loss of data [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\cli.vcxproj] [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
         d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\cli.c(543): warning C4996: 'strcpy': This function or variable may be unsafe. Consider using strcpy_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\cli.vcxproj] [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
         d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\cli.c(548): warning C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\cli.vcxproj] [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]
         d:\dcos\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8\src\c\src\cli.c(569): warning C4996: 'strcpy': This function or variable may be unsafe. Consider using strcpy_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8\src\zookeeper-3.4.8-build\cli.vcxproj] [D:\DCOS\mesos\3rdparty\zookeeper-3.4.8.vcxproj]


       "D:\DCOS\mesos\src\tests\mesos-tests.vcxproj" (default target) (1) ->
       "D:\DCOS\mesos\src\mesos.vcxproj" (default target) (20) ->
       (ClCompile target) -> 
         d:\dcos\mesos\mesos\src\slave\state.hpp(228): error C2660: 'os::rename': function does not take 3 arguments (compiling source file D:\DCOS\mesos\mesos\src\slave\containerizer\mesos\provisioner\provisioner.cpp) [D:\DCOS\mesos\src\mesos.vcxproj]
         d:\dcos\mesos\mesos\src\slave\state.hpp(228): error C2512: 'Try<Nothing,Error>': no appropriate default constructor available (compiling source file D:\DCOS\mesos\mesos\src\slave\containerizer\mesos\provisioner\provisioner.cpp) [D:\DCOS\mesos\src\mesos.vcxproj]
         d:\dcos\mesos\mesos\src\slave\state.hpp(228): error C2660: 'os::rename': function does not take 3 arguments (compiling source file D:\DCOS\mesos\mesos\src\slave\containerizer\mesos\provisioner\docker\metadata_manager.cpp) [D:\DCOS\mesos\src\mesos.vcxproj]
         d:\dcos\mesos\mesos\src\slave\state.hpp(228): error C2512: 'Try<Nothing,Error>': no appropriate default constructor available (compiling source file D:\DCOS\mesos\mesos\src\slave\containerizer\mesos\provisioner\docker\metadata_manager.cpp) [D:\DCOS\mesos\src\mesos.vcxproj]
         d:\dcos\mesos\mesos\src\slave\state.hpp(228): error C2660: 'os::rename': function does not take 3 arguments (compiling source file D:\DCOS\mesos\mesos\src\slave\slave.cpp) [D:\DCOS\mesos\src\mesos.vcxproj]
         d:\dcos\mesos\mesos\src\slave\state.hpp(228): error C2512: 'Try<Nothing,Error>': no appropriate default constructor available (compiling source file D:\DCOS\mesos\mesos\src\slave\slave.cpp) [D:\DCOS\mesos\src\mesos.vcxproj]

    172 Warning(s)
    6 Error(s)

Time Elapsed 00:07:08.66
```

- Mesos Reviewbot Windows


On Oct. 12, 2018, 11:19 p.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 12, 2018, 11:19 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Chun-Hung Hsiao <ch...@apache.org>.


> On Oct. 15, 2018, 2:21 p.m., Benjamin Bannier wrote:
> > src/resource_provider/storage/provider.cpp
> > Lines 1158-1171 (patched)
> > <https://reviews.apache.org/r/69010/diff/1/?file=2096857#file2096857line1158>
> >
> >     Since the connection of this to the surrounding could is not immediately clear, could this be made part of `recoverResourceProviderState`?
> 
> Chun-Hung Hsiao wrote:
>     This is for constructing the list of operations for SUM (which is an external component wrt the SLRP) to report its checkpointing, so the SLRP could *reconcile* the the status updates. `recoverResourceProviderState` is for reading the SLRP's own state checkpoint and reconstruct its bookkeeping in memory so it seems not a good idea to mix the logic here and that function.

Did a bit of refactoring. Now the garbage collection logic is moved to the `garbageCollectOperationPath` function.


> On Oct. 15, 2018, 2:21 p.m., Benjamin Bannier wrote:
> > src/resource_provider/storage/provider.cpp
> > Lines 1175-1187 (original), 1191-1203 (patched)
> > <https://reviews.apache.org/r/69010/diff/1/?file=2096857#file2096857line1191>
> >
> >     Since at this point `uuid` is already not tracked anymore, I'd suggest to move this garbage collection into `checkpointResourceProviderState`.
> >     
> >     In that more general approach we should probably always check that `path` exists before turning an `Error` to remove it into a `Failure`.
> 
> Chun-Hung Hsiao wrote:
>     The path is removed only if the operation is terminal. But we call `checkpointResourceProviderState` after every change in totol resources, operations, etc. Since this logic is exercised at a specific scenario, it seems better to keep that function simple. 
>     
>     Also note that the operation path is created by the SUM, not by SLRP.

Refactored to `garbageCollectOperationPath`.


> On Oct. 15, 2018, 2:21 p.m., Benjamin Bannier wrote:
> > src/resource_provider/storage/provider.cpp
> > Lines 1797-1811 (original), 1814-1828 (patched)
> > <https://reviews.apache.org/r/69010/diff/1/?file=2096857#file2096857line1814>
> >
> >     It seems this could be part of `checkpointResourceProviderState`, see above.

Refactored to `garbaceCollectOperationPath`.


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209542
-----------------------------------------------------------


On Oct. 15, 2018, 10:08 p.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 15, 2018, 10:08 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/2/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Benjamin Bannier <be...@mesosphere.io>.


> On Oct. 15, 2018, 4:21 p.m., Benjamin Bannier wrote:
> > src/slave/state.hpp
> > Line 192 (original), 196 (patched)
> > <https://reviews.apache.org/r/69010/diff/1/?file=2096858#file2096858line196>
> >
> >     I agree with James here. It seems totally fine to me to _always `sync`_ here. Could we do that? Alternatively we could introduce a dedicated function with weaker guarantees (e.g., `try_checkpoint`), but I don't see many good reasons for that, yet.
> 
> Chun-Hung Hsiao wrote:
>     Changing the behavior of checkpointing and backporting it without a thorough performance evaluation doesn't sound a good idea to me. Also note that we disabled `O_SYNC` for better performance before.
> 
> Chun-Hung Hsiao wrote:
>     Some numbers for fsync found on the Internet: https://gist.github.com/prashanthpai/e246be62656f25d7e31b
> 
> Chun-Hung Hsiao wrote:
>     For now I'll just leave a todo for enabling syncing by default. Does this sound okay to you?

Make sense. Since it is really suprising that `checkpoint` might not checkpoint, could you create a ticket to track. That could also be a place where we can discuss the trade-offs between correctness and performance.


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209542
-----------------------------------------------------------


On Oct. 16, 2018, 6:46 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 16, 2018, 6:46 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/4/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Chun-Hung Hsiao <ch...@apache.org>.


> On Oct. 15, 2018, 2:21 p.m., Benjamin Bannier wrote:
> > src/resource_provider/storage/provider.cpp
> > Lines 1158-1171 (patched)
> > <https://reviews.apache.org/r/69010/diff/1/?file=2096857#file2096857line1158>
> >
> >     Since the connection of this to the surrounding could is not immediately clear, could this be made part of `recoverResourceProviderState`?

This is for constructing the list of operations for SUM (which is an external component wrt the SLRP) to report its checkpointing, so the SLRP could *reconcile* the the status updates. `recoverResourceProviderState` is for reading the SLRP's own state checkpoint and reconstruct its bookkeeping in memory so it seems not a good idea to mix the logic here and that function.


> On Oct. 15, 2018, 2:21 p.m., Benjamin Bannier wrote:
> > src/resource_provider/storage/provider.cpp
> > Lines 1175-1187 (original), 1191-1203 (patched)
> > <https://reviews.apache.org/r/69010/diff/1/?file=2096857#file2096857line1191>
> >
> >     Since at this point `uuid` is already not tracked anymore, I'd suggest to move this garbage collection into `checkpointResourceProviderState`.
> >     
> >     In that more general approach we should probably always check that `path` exists before turning an `Error` to remove it into a `Failure`.

The path is removed only if the operation is terminal. But we call `checkpointResourceProviderState` after every change in totol resources, operations, etc. Since this logic is exercised at a specific scenario, it seems better to keep that function simple. 

Also note that the operation path is created by the SUM, not by SLRP.


> On Oct. 15, 2018, 2:21 p.m., Benjamin Bannier wrote:
> > src/slave/state.hpp
> > Line 192 (original), 196 (patched)
> > <https://reviews.apache.org/r/69010/diff/1/?file=2096858#file2096858line196>
> >
> >     I agree with James here. It seems totally fine to me to _always `sync`_ here. Could we do that? Alternatively we could introduce a dedicated function with weaker guarantees (e.g., `try_checkpoint`), but I don't see many good reasons for that, yet.

Changing the behavior of checkpointing and backporting it without a thorough performance evaluation doesn't sound a good idea to me. Also note that we disabled `O_SYNC` for better performance before.


> On Oct. 15, 2018, 2:21 p.m., Benjamin Bannier wrote:
> > src/slave/state.hpp
> > Line 214 (original), 218 (patched)
> > <https://reviews.apache.org/r/69010/diff/1/?file=2096858#file2096858line218>
> >
> >     Is this `sync` required here? It seems syncing after below `rename` would be totally fine.

Yes. We needs two `fsync`s to ensure the checkpoint is committed to the filesystem: one for the file (done through `write`, one for the directory (done through `rename`).


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209542
-----------------------------------------------------------


On Oct. 12, 2018, 11:19 p.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 12, 2018, 11:19 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Chun-Hung Hsiao <ch...@apache.org>.


> On Oct. 15, 2018, 2:21 p.m., Benjamin Bannier wrote:
> > src/slave/state.hpp
> > Line 192 (original), 196 (patched)
> > <https://reviews.apache.org/r/69010/diff/1/?file=2096858#file2096858line196>
> >
> >     I agree with James here. It seems totally fine to me to _always `sync`_ here. Could we do that? Alternatively we could introduce a dedicated function with weaker guarantees (e.g., `try_checkpoint`), but I don't see many good reasons for that, yet.
> 
> Chun-Hung Hsiao wrote:
>     Changing the behavior of checkpointing and backporting it without a thorough performance evaluation doesn't sound a good idea to me. Also note that we disabled `O_SYNC` for better performance before.
> 
> Chun-Hung Hsiao wrote:
>     Some numbers for fsync found on the Internet: https://gist.github.com/prashanthpai/e246be62656f25d7e31b

For now I'll just leave a todo for enabling syncing by default. Does this sound okay to you?


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209542
-----------------------------------------------------------


On Oct. 16, 2018, 2:48 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 16, 2018, 2:48 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/3/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Chun-Hung Hsiao <ch...@apache.org>.


> On Oct. 15, 2018, 2:21 p.m., Benjamin Bannier wrote:
> > src/slave/state.hpp
> > Line 192 (original), 196 (patched)
> > <https://reviews.apache.org/r/69010/diff/1/?file=2096858#file2096858line196>
> >
> >     I agree with James here. It seems totally fine to me to _always `sync`_ here. Could we do that? Alternatively we could introduce a dedicated function with weaker guarantees (e.g., `try_checkpoint`), but I don't see many good reasons for that, yet.
> 
> Chun-Hung Hsiao wrote:
>     Changing the behavior of checkpointing and backporting it without a thorough performance evaluation doesn't sound a good idea to me. Also note that we disabled `O_SYNC` for better performance before.

Some numbers for fsync found on the Internet: https://gist.github.com/prashanthpai/e246be62656f25d7e31b


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209542
-----------------------------------------------------------


On Oct. 16, 2018, 2:48 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 16, 2018, 2:48 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/3/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Benjamin Bannier <be...@mesosphere.io>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209542
-----------------------------------------------------------




src/resource_provider/storage/provider.cpp
Line 663 (original), 663 (patched)
<https://reviews.apache.org/r/69010/#comment294055>

    This could be a separate, absolutely non-controversal patch.



src/resource_provider/storage/provider.cpp
Lines 1158-1171 (patched)
<https://reviews.apache.org/r/69010/#comment294054>

    Since the connection of this to the surrounding could is not immediately clear, could this be made part of `recoverResourceProviderState`?



src/resource_provider/storage/provider.cpp
Lines 1175-1187 (original), 1191-1203 (patched)
<https://reviews.apache.org/r/69010/#comment294056>

    Since at this point `uuid` is already not tracked anymore, I'd suggest to move this garbage collection into `checkpointResourceProviderState`.
    
    In that more general approach we should probably always check that `path` exists before turning an `Error` to remove it into a `Failure`.



src/resource_provider/storage/provider.cpp
Lines 1797-1811 (original), 1814-1828 (patched)
<https://reviews.apache.org/r/69010/#comment294057>

    It seems this could be part of `checkpointResourceProviderState`, see above.



src/slave/state.hpp
Lines 126 (patched)
<https://reviews.apache.org/r/69010/#comment294051>

    See below.



src/slave/state.hpp
Line 136 (original), 137 (patched)
<https://reviews.apache.org/r/69010/#comment294052>

    See below.



src/slave/state.hpp
Lines 173 (patched)
<https://reviews.apache.org/r/69010/#comment294053>

    See below.



src/slave/state.hpp
Line 192 (original), 196 (patched)
<https://reviews.apache.org/r/69010/#comment294045>

    I agree with James here. It seems totally fine to me to _always `sync`_ here. Could we do that? Alternatively we could introduce a dedicated function with weaker guarantees (e.g., `try_checkpoint`), but I don't see many good reasons for that, yet.



src/slave/state.hpp
Line 214 (original), 218 (patched)
<https://reviews.apache.org/r/69010/#comment294046>

    Is this `sync` required here? It seems syncing after below `rename` would be totally fine.


- Benjamin Bannier


On Oct. 13, 2018, 1:19 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 13, 2018, 1:19 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>

Re: Review Request 69010: Synced SLRP checkpoints to the filesystem.

Posted by Mesos Reviewbot <re...@mesos.apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69010/#review209565
-----------------------------------------------------------



Patch looks great!

Reviews applied: [69009, 69010]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose --disable-libtool-wrappers' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh

- Mesos Reviewbot


On Oct. 13, 2018, 1:19 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69010/
> -----------------------------------------------------------
> 
> (Updated Oct. 13, 2018, 1:19 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9281
>     https://issues.apache.org/jira/browse/MESOS-9281
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently if a system crashes, SLRP checkpoints might not be synced to
> the filesystem, so it is possible that an old or empty checkpoint will
> be read upon recovery. Moreover, if a CSI call has been issued right
> before the crash, the recovered state may be inconsistent with the
> actual state reported by the plugin. For example, the plugin might have
> created a volume but the checkpointed state does not know about it.
> 
> To avoid this inconsistency, we always call fsync()  when checkpointing
> SLRP states.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp db783b53558811081fb2671e005e8bbbd9edbede 
>   src/slave/state.hpp 003211e4670c1092acb1634220d76bafd39e3a20 
> 
> 
> Diff: https://reviews.apache.org/r/69010/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>