You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Chun-Hung Hsiao <ch...@apache.org> on 2019/02/05 07:42:29 UTC

Review Request 69893: Clean up persistent volumes on SLRP disks.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69893/
-----------------------------------------------------------

Review request for mesos, Benjamin Bannier, James DeFelice, and Jie Yu.


Bugs: MESOS-9544
    https://issues.apache.org/jira/browse/MESOS-9544


Repository: mesos


Description
-------

This patch limits SLRP to only support persistent volumes on MOUNT
disks, and makes it clean up data in persistent volumes when processing
`DESTROY` operations.


Diffs
-----

  src/resource_provider/storage/provider.cpp d6e20a549ede189c757ae3ae922ab7cb86d2be2c 
  src/resource_provider/storage/provider_process.hpp PRE-CREATION 


Diff: https://reviews.apache.org/r/69893/diff/1/


Testing
-------

`make check`

More testing done later in chain.


Thanks,

Chun-Hung Hsiao


Re: Review Request 69893: Clean up persistent volumes on SLRP disks.

Posted by Chun-Hung Hsiao <ch...@apache.org>.

> On Feb. 12, 2019, 2 p.m., Benjamin Bannier wrote:
> > src/resource_provider/storage/provider.cpp
> > Lines 3285-3291 (patched)
> > <https://reviews.apache.org/r/69893/diff/2/?file=2124904#file2124904line3287>
> >
> >     I am wondering whether it would make sense to try to remove data from as many volumes as possible instead of erroring out on the first failure. Not sure as users are left in a weird intermediate state anyway.
> 
> Chun-Hung Hsiao wrote:
>     Good point, especially since we don't crash the agent or SLRP. Let me do that.

After an offline discussion, the conclusion is that failures won't result in any data leakage since the PVs will still be there and the framework will have to retry, so stopping early is fine. The volumes will be in an intermediate state that some of the data might have been removed but some not, but it doesn't sound a valid use case to expect a PV to have data after a failed `DESTROY`.


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69893/#review212738
-----------------------------------------------------------


On Feb. 12, 2019, 5:19 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69893/
> -----------------------------------------------------------
> 
> (Updated Feb. 12, 2019, 5:19 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, James DeFelice, and Jie Yu.
> 
> 
> Bugs: MESOS-9544
>     https://issues.apache.org/jira/browse/MESOS-9544
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch limits SLRP to only support persistent volumes on MOUNT
> disks, and makes it clean up data in persistent volumes when processing
> `DESTROY` operations.
> 
> NOTE: Persistent volumes backed by CSI disks that are created before
> upgrading to a Mesos version that does not include this fix are subject
> to data leakage. To ensure data security, these persistent volume must
> be consumed by a task at least once after the upgrade before being
> destroyed.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp 09a710d668a5a7460b6c4e4fa32d3829dca7ac55 
>   src/resource_provider/storage/provider_process.hpp 36187fb4a25c49653530ee286fa9c1663177fbc6 
> 
> 
> Diff: https://reviews.apache.org/r/69893/diff/2/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> More testing done later in chain.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 69893: Clean up persistent volumes on SLRP disks.

Posted by Chun-Hung Hsiao <ch...@apache.org>.

> On Feb. 12, 2019, 2 p.m., Benjamin Bannier wrote:
> > src/resource_provider/storage/provider.cpp
> > Lines 3285-3291 (patched)
> > <https://reviews.apache.org/r/69893/diff/2/?file=2124904#file2124904line3287>
> >
> >     I am wondering whether it would make sense to try to remove data from as many volumes as possible instead of erroring out on the first failure. Not sure as users are left in a weird intermediate state anyway.

Good point, especially since we don't crash the agent or SLRP. Let me do that.


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69893/#review212738
-----------------------------------------------------------


On Feb. 12, 2019, 5:19 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69893/
> -----------------------------------------------------------
> 
> (Updated Feb. 12, 2019, 5:19 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, James DeFelice, and Jie Yu.
> 
> 
> Bugs: MESOS-9544
>     https://issues.apache.org/jira/browse/MESOS-9544
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch limits SLRP to only support persistent volumes on MOUNT
> disks, and makes it clean up data in persistent volumes when processing
> `DESTROY` operations.
> 
> NOTE: Persistent volumes backed by CSI disks that are created before
> upgrading to a Mesos version that does not include this fix are subject
> to data leakage. To ensure data security, these persistent volume must
> be consumed by a task at least once after the upgrade before being
> destroyed.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp 09a710d668a5a7460b6c4e4fa32d3829dca7ac55 
>   src/resource_provider/storage/provider_process.hpp 36187fb4a25c49653530ee286fa9c1663177fbc6 
> 
> 
> Diff: https://reviews.apache.org/r/69893/diff/2/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> More testing done later in chain.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 69893: Clean up persistent volumes on SLRP disks.

Posted by Benjamin Bannier <be...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69893/#review212738
-----------------------------------------------------------


Fix it, then Ship it!





src/resource_provider/storage/provider.cpp
Lines 3285-3291 (patched)
<https://reviews.apache.org/r/69893/#comment298628>

    I am wondering whether it would make sense to try to remove data from as many volumes as possible instead of erroring out on the first failure. Not sure as users are left in a weird intermediate state anyway.



src/resource_provider/storage/provider_process.hpp
Lines 317 (patched)
<https://reviews.apache.org/r/69893/#comment298625>

    This could be `const.`



src/resource_provider/storage/provider_process.hpp
Lines 320 (patched)
<https://reviews.apache.org/r/69893/#comment298626>

    This could be `const.`


- Benjamin Bannier


On Feb. 12, 2019, 6:19 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69893/
> -----------------------------------------------------------
> 
> (Updated Feb. 12, 2019, 6:19 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, James DeFelice, and Jie Yu.
> 
> 
> Bugs: MESOS-9544
>     https://issues.apache.org/jira/browse/MESOS-9544
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch limits SLRP to only support persistent volumes on MOUNT
> disks, and makes it clean up data in persistent volumes when processing
> `DESTROY` operations.
> 
> NOTE: Persistent volumes backed by CSI disks that are created before
> upgrading to a Mesos version that does not include this fix are subject
> to data leakage. To ensure data security, these persistent volume must
> be consumed by a task at least once after the upgrade before being
> destroyed.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp 09a710d668a5a7460b6c4e4fa32d3829dca7ac55 
>   src/resource_provider/storage/provider_process.hpp 36187fb4a25c49653530ee286fa9c1663177fbc6 
> 
> 
> Diff: https://reviews.apache.org/r/69893/diff/2/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> More testing done later in chain.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 69893: Cleaned up persistent volumes on SLRP disks.

Posted by Chun-Hung Hsiao <ch...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69893/
-----------------------------------------------------------

(Updated Feb. 13, 2019, 3:11 a.m.)


Review request for mesos, Benjamin Bannier, James DeFelice, and Jie Yu.


Changes
-------

Addressed Benjamin's comments.


Summary (updated)
-----------------

Cleaned up persistent volumes on SLRP disks.


Bugs: MESOS-9544
    https://issues.apache.org/jira/browse/MESOS-9544


Repository: mesos


Description
-------

This patch limits SLRP to only support persistent volumes on MOUNT
disks, and makes it clean up data in persistent volumes when processing
`DESTROY` operations.

NOTE: Persistent volumes backed by CSI disks that are created before
upgrading to a Mesos version that does not include this fix are subject
to data leakage. To ensure data security, these persistent volume must
be consumed by a task at least once after the upgrade before being
destroyed.


Diffs (updated)
-----

  src/resource_provider/storage/provider.cpp 09a710d668a5a7460b6c4e4fa32d3829dca7ac55 
  src/resource_provider/storage/provider_process.hpp 36187fb4a25c49653530ee286fa9c1663177fbc6 


Diff: https://reviews.apache.org/r/69893/diff/3/

Changes: https://reviews.apache.org/r/69893/diff/2-3/


Testing
-------

`make check`

More testing done later in chain.


Thanks,

Chun-Hung Hsiao


Re: Review Request 69893: Clean up persistent volumes on SLRP disks.

Posted by Chun-Hung Hsiao <ch...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69893/
-----------------------------------------------------------

(Updated Feb. 12, 2019, 5:19 a.m.)


Review request for mesos, Benjamin Bannier, James DeFelice, and Jie Yu.


Changes
-------

Addressed Benjamin's comment.


Bugs: MESOS-9544
    https://issues.apache.org/jira/browse/MESOS-9544


Repository: mesos


Description (updated)
-------

This patch limits SLRP to only support persistent volumes on MOUNT
disks, and makes it clean up data in persistent volumes when processing
`DESTROY` operations.

NOTE: Persistent volumes backed by CSI disks that are created before
upgrading to a Mesos version that does not include this fix are subject
to data leakage. To ensure data security, these persistent volume must
be consumed by a task at least once after the upgrade before being
destroyed.


Diffs (updated)
-----

  src/resource_provider/storage/provider.cpp 09a710d668a5a7460b6c4e4fa32d3829dca7ac55 
  src/resource_provider/storage/provider_process.hpp 36187fb4a25c49653530ee286fa9c1663177fbc6 


Diff: https://reviews.apache.org/r/69893/diff/2/

Changes: https://reviews.apache.org/r/69893/diff/1-2/


Testing
-------

`make check`

More testing done later in chain.


Thanks,

Chun-Hung Hsiao


Re: Review Request 69893: Clean up persistent volumes on SLRP disks.

Posted by Benjamin Bannier <be...@mesosphere.io>.

> On Feb. 5, 2019, 6:55 p.m., Benjamin Bannier wrote:
> > src/resource_provider/storage/provider.cpp
> > Lines 3244 (patched)
> > <https://reviews.apache.org/r/69893/diff/1/?file=2123916#file2123916line3246>
> >
> >     Does this break seemless upgrades? Probably okay, but still something worth documenting in the upgrades guide.
> 
> Chun-Hung Hsiao wrote:
>     This depends on what we'll do for https://reviews.apache.org/r/69892/#comment298376. In the current proposal where there are separated boot IDs, this won't introduce any upgrade issue since it will be empty anyway. During SLRP recovery, it would treat an existing published volume as `VOL_READY` but not `PUBLISHED` and will call `NodePublishVolume` again. But this is okay since `NodePublishVolume` is idempotent.
> 
> Chun-Hung Hsiao wrote:
>     This is a bug fix and will be backported to 1.7.x, so I'm not sure how to properly document this in `upgrades.md`. I've added some documentation to the commit message, and will also add that to the MESOS ticket and the CHANGELOG when committing the fix. Would this be enough?

Sg!


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69893/#review212558
-----------------------------------------------------------


On Feb. 12, 2019, 6:19 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69893/
> -----------------------------------------------------------
> 
> (Updated Feb. 12, 2019, 6:19 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, James DeFelice, and Jie Yu.
> 
> 
> Bugs: MESOS-9544
>     https://issues.apache.org/jira/browse/MESOS-9544
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch limits SLRP to only support persistent volumes on MOUNT
> disks, and makes it clean up data in persistent volumes when processing
> `DESTROY` operations.
> 
> NOTE: Persistent volumes backed by CSI disks that are created before
> upgrading to a Mesos version that does not include this fix are subject
> to data leakage. To ensure data security, these persistent volume must
> be consumed by a task at least once after the upgrade before being
> destroyed.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp 09a710d668a5a7460b6c4e4fa32d3829dca7ac55 
>   src/resource_provider/storage/provider_process.hpp 36187fb4a25c49653530ee286fa9c1663177fbc6 
> 
> 
> Diff: https://reviews.apache.org/r/69893/diff/2/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> More testing done later in chain.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 69893: Clean up persistent volumes on SLRP disks.

Posted by Chun-Hung Hsiao <ch...@apache.org>.

> On Feb. 5, 2019, 5:55 p.m., Benjamin Bannier wrote:
> > src/resource_provider/storage/provider.cpp
> > Lines 3137 (patched)
> > <https://reviews.apache.org/r/69893/diff/1/?file=2123916#file2123916line3139>
> >
> >     This is some pretty strong coupling to `master::validate` checking `resource::validatePersistentVolume`. That's probably okay, but would could also just re-validate here instead.

This is similar to what we do between the master and the agent: the master validates and the agent checks.


> On Feb. 5, 2019, 5:55 p.m., Benjamin Bannier wrote:
> > src/resource_provider/storage/provider.cpp
> > Lines 3125-3126 (original), 3138-3139 (patched)
> > <https://reviews.apache.org/r/69893/diff/1/?file=2123916#file2123916line3140>
> >
> >     Do we actually validate this somewhere?

https://github.com/apache/mesos/blob/ae7e7330d7fc7bbe2ce923025025577e087f8348/src/master/validation.cpp#L2570-L2573


> On Feb. 5, 2019, 5:55 p.m., Benjamin Bannier wrote:
> > src/resource_provider/storage/provider.cpp
> > Lines 3230 (patched)
> > <https://reviews.apache.org/r/69893/diff/1/?file=2123916#file2123916line3232>
> >
> >     Is this validated somewhere?

https://github.com/apache/mesos/blob/ae7e7330d7fc7bbe2ce923025025577e087f8348/src/master/validation.cpp#L2570-L2573 combined with L3208--L3213 in this patch.


> On Feb. 5, 2019, 5:55 p.m., Benjamin Bannier wrote:
> > src/resource_provider/storage/provider.cpp
> > Lines 3234 (patched)
> > <https://reviews.apache.org/r/69893/diff/1/?file=2123916#file2123916line3236>
> >
> >     Is this validated?

All MOUNT disk resources managed by SLRP will have an entry in `volumes`, and the master only allows an operation to be applied if its consumed resources are contained in the resource pool, hence this is validated implicitly by the master.


> On Feb. 5, 2019, 5:55 p.m., Benjamin Bannier wrote:
> > src/resource_provider/storage/provider.cpp
> > Lines 3244 (patched)
> > <https://reviews.apache.org/r/69893/diff/1/?file=2123916#file2123916line3246>
> >
> >     Does this break seemless upgrades? Probably okay, but still something worth documenting in the upgrades guide.

This depends on what we'll do for https://reviews.apache.org/r/69892/#comment298376. In the current proposal where there are separated boot IDs, this won't introduce any upgrade issue since it will be empty anyway. During SLRP recovery, it would treat an existing published volume as `VOL_READY` but not `PUBLISHED` and will call `NodePublishVolume` again. But this is okay since `NodePublishVolume` is idempotent.


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69893/#review212558
-----------------------------------------------------------


On Feb. 5, 2019, 7:42 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69893/
> -----------------------------------------------------------
> 
> (Updated Feb. 5, 2019, 7:42 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, James DeFelice, and Jie Yu.
> 
> 
> Bugs: MESOS-9544
>     https://issues.apache.org/jira/browse/MESOS-9544
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch limits SLRP to only support persistent volumes on MOUNT
> disks, and makes it clean up data in persistent volumes when processing
> `DESTROY` operations.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp d6e20a549ede189c757ae3ae922ab7cb86d2be2c 
>   src/resource_provider/storage/provider_process.hpp PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/69893/diff/1/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> More testing done later in chain.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 69893: Clean up persistent volumes on SLRP disks.

Posted by Chun-Hung Hsiao <ch...@apache.org>.

> On Feb. 5, 2019, 5:55 p.m., Benjamin Bannier wrote:
> > src/resource_provider/storage/provider.cpp
> > Lines 3244 (patched)
> > <https://reviews.apache.org/r/69893/diff/1/?file=2123916#file2123916line3246>
> >
> >     Does this break seemless upgrades? Probably okay, but still something worth documenting in the upgrades guide.
> 
> Chun-Hung Hsiao wrote:
>     This depends on what we'll do for https://reviews.apache.org/r/69892/#comment298376. In the current proposal where there are separated boot IDs, this won't introduce any upgrade issue since it will be empty anyway. During SLRP recovery, it would treat an existing published volume as `VOL_READY` but not `PUBLISHED` and will call `NodePublishVolume` again. But this is okay since `NodePublishVolume` is idempotent.

This is a bug fix and will be backported to 1.7.x, so I'm not sure how to properly document this in `upgrades.md`. I've added some documentation to the commit message, and will also add that to the MESOS ticket and the CHANGELOG when committing the fix. Would this be enough?


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69893/#review212558
-----------------------------------------------------------


On Feb. 5, 2019, 7:42 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69893/
> -----------------------------------------------------------
> 
> (Updated Feb. 5, 2019, 7:42 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, James DeFelice, and Jie Yu.
> 
> 
> Bugs: MESOS-9544
>     https://issues.apache.org/jira/browse/MESOS-9544
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch limits SLRP to only support persistent volumes on MOUNT
> disks, and makes it clean up data in persistent volumes when processing
> `DESTROY` operations.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp d6e20a549ede189c757ae3ae922ab7cb86d2be2c 
>   src/resource_provider/storage/provider_process.hpp PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/69893/diff/1/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> More testing done later in chain.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 69893: Clean up persistent volumes on SLRP disks.

Posted by Benjamin Bannier <be...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69893/#review212558
-----------------------------------------------------------




src/resource_provider/storage/provider.cpp
Lines 3137 (patched)
<https://reviews.apache.org/r/69893/#comment298378>

    This is some pretty strong coupling to `master::validate` checking `resource::validatePersistentVolume`. That's probably okay, but would could also just re-validate here instead.



src/resource_provider/storage/provider.cpp
Lines 3125-3126 (original), 3138-3139 (patched)
<https://reviews.apache.org/r/69893/#comment298379>

    Do we actually validate this somewhere?



src/resource_provider/storage/provider.cpp
Lines 3230 (patched)
<https://reviews.apache.org/r/69893/#comment298380>

    Is this validated somewhere?



src/resource_provider/storage/provider.cpp
Lines 3234 (patched)
<https://reviews.apache.org/r/69893/#comment298381>

    Is this validated?



src/resource_provider/storage/provider.cpp
Lines 3244 (patched)
<https://reviews.apache.org/r/69893/#comment298382>

    Does this break seemless upgrades? Probably okay, but still something worth documenting in the upgrades guide.


- Benjamin Bannier


On Feb. 5, 2019, 8:42 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69893/
> -----------------------------------------------------------
> 
> (Updated Feb. 5, 2019, 8:42 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, James DeFelice, and Jie Yu.
> 
> 
> Bugs: MESOS-9544
>     https://issues.apache.org/jira/browse/MESOS-9544
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch limits SLRP to only support persistent volumes on MOUNT
> disks, and makes it clean up data in persistent volumes when processing
> `DESTROY` operations.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/storage/provider.cpp d6e20a549ede189c757ae3ae922ab7cb86d2be2c 
>   src/resource_provider/storage/provider_process.hpp PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/69893/diff/1/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> More testing done later in chain.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>