You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Jie Yu <yu...@gmail.com> on 2014/12/08 21:43:37 UTC

Review Request 28809: Started to maintain and checkpoint persisted resource in slave.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28809/
-----------------------------------------------------------

Review request for mesos and Ben Mahler.


Bugs: MESOS-2031
    https://issues.apache.org/jira/browse/MESOS-2031


Repository: mesos-git


Description
-------

Started to maintain and checkpoint persisted resource in slave. That includes:
1) responds to update resources message
2) checkpoint resources
3) recover checkpointed resources
4) send checkpointed resources during register/reregister


Diffs
-----

  src/slave/slave.hpp 70bd8c1fde4ea09fa54c76aa93424a1adb0309f6 
  src/slave/slave.cpp 9ac64589c353b2f17f538db7de01faa55b2369b9 

Diff: https://reviews.apache.org/r/28809/diff/


Testing
-------

make check


Thanks,

Jie Yu


Re: Review Request 28809: Started to maintain and checkpoint persisted resource in slave.

Posted by Mesos ReviewBot <de...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28809/#review64318
-----------------------------------------------------------


Bad patch!

Reviews applied: [28711, 28720]

Failed command: ./support/apply-review.sh -n -r 28720

Error:
 2014-12-09 00:04:12 URL:https://reviews.apache.org/r/28720/diff/raw/ [9603/9603] -> "28720.patch" [1]
error: patch failed: src/master/master.cpp:1796
error: src/master/master.cpp: patch does not apply
error: patch failed: src/tests/resource_offers_tests.cpp:256
error: src/tests/resource_offers_tests.cpp: patch does not apply
Failed to apply patch

- Mesos ReviewBot


On Dec. 8, 2014, 10:19 p.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/28809/
> -----------------------------------------------------------
> 
> (Updated Dec. 8, 2014, 10:19 p.m.)
> 
> 
> Review request for mesos and Ben Mahler.
> 
> 
> Bugs: MESOS-2031
>     https://issues.apache.org/jira/browse/MESOS-2031
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Started to maintain and checkpoint persisted resource in slave. That includes:
> 1) responds to update resources message
> 2) checkpoint resources
> 3) recover checkpointed resources
> 4) send checkpointed resources during register/reregister
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 70bd8c1fde4ea09fa54c76aa93424a1adb0309f6 
>   src/slave/slave.cpp 9ac64589c353b2f17f538db7de01faa55b2369b9 
> 
> Diff: https://reviews.apache.org/r/28809/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>


Re: Review Request 28809: Started to maintain and checkpoint persisted resource in slave.

Posted by Jie Yu <yu...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28809/
-----------------------------------------------------------

(Updated Feb. 9, 2015, 7:33 p.m.)


Review request for mesos and Ben Mahler.


Changes
-------

Vinod's comments.


Bugs: MESOS-2031
    https://issues.apache.org/jira/browse/MESOS-2031


Repository: mesos


Description
-------

Started to maintain and checkpoint persisted resource in slave. That includes:
1) responds to update resources message
2) checkpoint resources
3) recover checkpointed resources
4) send checkpointed resources during register/reregister


Diffs (updated)
-----

  src/slave/slave.hpp 9adee17cb94a72f0e1e139b3fd8978a9a1ff6237 
  src/slave/slave.cpp fff2d725fe49eee984d9151cfb2131202c47994f 
  src/slave/state.hpp 04084fc5c9130bf7d824b2ced4eb7053a995edd0 
  src/tests/persistent_volume_tests.cpp ffbaedddbd21d0eb7964be0bff4368384389b0a0 

Diff: https://reviews.apache.org/r/28809/diff/


Testing
-------

make check


Thanks,

Jie Yu


Re: Review Request 28809: Started to maintain and checkpoint persisted resource in slave.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28809/#review71666
-----------------------------------------------------------

Ship it!



src/slave/slave.cpp
<https://reviews.apache.org/r/28809/#comment117529>

    s/delivery/delivery per connection/



src/slave/slave.cpp
<https://reviews.apache.org/r/28809/#comment117532>

    s/we create ephemeral connections/master creates multiple connections to the slave in some cases (e.g., persistent socket to slave breaks and master uses ephemeral socket)/



src/slave/slave.cpp
<https://reviews.apache.org/r/28809/#comment117533>

    s/we are expecting the following cases/there are two cases to consider/



src/slave/slave.cpp
<https://reviews.apache.org/r/28809/#comment117534>

    s/reconcile the state/reconcile the state with the slave/



src/slave/slave.cpp
<https://reviews.apache.org/r/28809/#comment117535>

    Also mention that, since master is the source of truth for reservations the inconsistency is not exposed to frameworks?



src/slave/slave.cpp
<https://reviews.apache.org/r/28809/#comment117536>

    s/If/When/



src/tests/persistent_volume_tests.cpp
<https://reviews.apache.org/r/28809/#comment117538>

    s/registration/re-registration/


- Vinod Kone


On Feb. 9, 2015, 5:59 p.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/28809/
> -----------------------------------------------------------
> 
> (Updated Feb. 9, 2015, 5:59 p.m.)
> 
> 
> Review request for mesos and Ben Mahler.
> 
> 
> Bugs: MESOS-2031
>     https://issues.apache.org/jira/browse/MESOS-2031
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Started to maintain and checkpoint persisted resource in slave. That includes:
> 1) responds to update resources message
> 2) checkpoint resources
> 3) recover checkpointed resources
> 4) send checkpointed resources during register/reregister
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 9adee17cb94a72f0e1e139b3fd8978a9a1ff6237 
>   src/slave/slave.cpp fff2d725fe49eee984d9151cfb2131202c47994f 
>   src/slave/state.hpp 04084fc5c9130bf7d824b2ced4eb7053a995edd0 
>   src/tests/persistent_volume_tests.cpp ffbaedddbd21d0eb7964be0bff4368384389b0a0 
> 
> Diff: https://reviews.apache.org/r/28809/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>


Re: Review Request 28809: Started to maintain and checkpoint persisted resource in slave.

Posted by Jie Yu <yu...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28809/
-----------------------------------------------------------

(Updated Feb. 9, 2015, 5:59 p.m.)


Review request for mesos and Ben Mahler.


Changes
-------

Rebased.


Bugs: MESOS-2031
    https://issues.apache.org/jira/browse/MESOS-2031


Repository: mesos


Description
-------

Started to maintain and checkpoint persisted resource in slave. That includes:
1) responds to update resources message
2) checkpoint resources
3) recover checkpointed resources
4) send checkpointed resources during register/reregister


Diffs (updated)
-----

  src/slave/slave.hpp 9adee17cb94a72f0e1e139b3fd8978a9a1ff6237 
  src/slave/slave.cpp fff2d725fe49eee984d9151cfb2131202c47994f 
  src/slave/state.hpp 04084fc5c9130bf7d824b2ced4eb7053a995edd0 
  src/tests/persistent_volume_tests.cpp ffbaedddbd21d0eb7964be0bff4368384389b0a0 

Diff: https://reviews.apache.org/r/28809/diff/


Testing
-------

make check


Thanks,

Jie Yu


Re: Review Request 28809: Started to maintain and checkpoint persisted resource in slave.

Posted by Jie Yu <yu...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28809/
-----------------------------------------------------------

(Updated Feb. 6, 2015, 12:05 a.m.)


Review request for mesos and Ben Mahler.


Changes
-------

Addressed review comments and added tests.


Bugs: MESOS-2031
    https://issues.apache.org/jira/browse/MESOS-2031


Repository: mesos


Description
-------

Started to maintain and checkpoint persisted resource in slave. That includes:
1) responds to update resources message
2) checkpoint resources
3) recover checkpointed resources
4) send checkpointed resources during register/reregister


Diffs (updated)
-----

  src/slave/slave.hpp 70bd8c1fde4ea09fa54c76aa93424a1adb0309f6 
  src/slave/slave.cpp a8b262174ab5c9a524db8318d3d1438cd75a702b 
  src/slave/state.hpp f92808ad9b1623cea0c35ec735c53a3d6457bdbe 
  src/tests/persistent_volume_tests.cpp PRE-CREATION 

Diff: https://reviews.apache.org/r/28809/diff/


Testing
-------

make check


Thanks,

Jie Yu


Re: Review Request 28809: Started to maintain and checkpoint persisted resource in slave.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28809/#review71111
-----------------------------------------------------------


tests?


src/slave/slave.cpp
<https://reviews.apache.org/r/28809/#comment116715>

    s/and/or/



src/slave/slave.cpp
<https://reviews.apache.org/r/28809/#comment116713>

    the fact that a failed slave will restart is dependent on the cluster operator right? how about 
    
    "..we simply fail the slave to be safe."



src/slave/slave.cpp
<https://reviews.apache.org/r/28809/#comment116718>

    s/from/for/



src/slave/slave.cpp
<https://reviews.apache.org/r/28809/#comment116721>

    s/from/for/



src/slave/slave.cpp
<https://reviews.apache.org/r/28809/#comment116709>

    Can you also expand on the implications of out-of-order messages? Does the master reconcile them? I guess if master fails over we cannot reconcile. Do we expect Frameworks to reconcile in this case based on the offers they get?


- Vinod Kone


On Feb. 4, 2015, 7:13 p.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/28809/
> -----------------------------------------------------------
> 
> (Updated Feb. 4, 2015, 7:13 p.m.)
> 
> 
> Review request for mesos and Ben Mahler.
> 
> 
> Bugs: MESOS-2031
>     https://issues.apache.org/jira/browse/MESOS-2031
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Started to maintain and checkpoint persisted resource in slave. That includes:
> 1) responds to update resources message
> 2) checkpoint resources
> 3) recover checkpointed resources
> 4) send checkpointed resources during register/reregister
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 70bd8c1fde4ea09fa54c76aa93424a1adb0309f6 
>   src/slave/slave.cpp a8b262174ab5c9a524db8318d3d1438cd75a702b 
> 
> Diff: https://reviews.apache.org/r/28809/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>


Re: Review Request 28809: Started to maintain and checkpoint persisted resource in slave.

Posted by Jie Yu <yu...@gmail.com>.

> On Feb. 5, 2015, 7:38 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, lines 1335-1353
> > <https://reviews.apache.org/r/28809/diff/4/?file=848405#file848405line1335>
> >
> >     Is it safe to do this once on the combined resources?
> >     
> >     ```
> >     Resources resources = task.resources();
> >     
> >     if (task.has_executor()) {
> >       resources += task.executor().resources();
> >     }
> >     
> >     foreach (const Resource& resource, resources) {
> >       ...
> >     }
> >     ```

It's safe, but I want to print different ERROR messages for task and executor.


> On Feb. 5, 2015, 7:38 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, lines 3478-3479
> > <https://reviews.apache.org/r/28809/diff/4/?file=848405#file848405line3478>
> >
> >     It's too bad the `errors` are just unsigned ints and not a collection of `Option<Error>`s that we can print here.

Added a TODO.


- Jie


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28809/#review71279
-----------------------------------------------------------


On Feb. 4, 2015, 7:13 p.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/28809/
> -----------------------------------------------------------
> 
> (Updated Feb. 4, 2015, 7:13 p.m.)
> 
> 
> Review request for mesos and Ben Mahler.
> 
> 
> Bugs: MESOS-2031
>     https://issues.apache.org/jira/browse/MESOS-2031
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Started to maintain and checkpoint persisted resource in slave. That includes:
> 1) responds to update resources message
> 2) checkpoint resources
> 3) recover checkpointed resources
> 4) send checkpointed resources during register/reregister
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 70bd8c1fde4ea09fa54c76aa93424a1adb0309f6 
>   src/slave/slave.cpp a8b262174ab5c9a524db8318d3d1438cd75a702b 
> 
> Diff: https://reviews.apache.org/r/28809/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>


Re: Review Request 28809: Started to maintain and checkpoint persisted resource in slave.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28809/#review71279
-----------------------------------------------------------


Looks good! Will take another look when Vinod's comments are addressed.


src/slave/slave.cpp
<https://reviews.apache.org/r/28809/#comment116951>

    Is it safe to do this once on the combined resources?
    
    ```
    Resources resources = task.resources();
    
    if (task.has_executor()) {
      resources += task.executor().resources();
    }
    
    foreach (const Resource& resource, resources) {
      ...
    }
    ```



src/slave/slave.cpp
<https://reviews.apache.org/r/28809/#comment116953>

    One line?
    
    ```
      << "Failed to checkpoint resources " << newCheckpointedResources;
    ```



src/slave/slave.cpp
<https://reviews.apache.org/r/28809/#comment116955>

    It's too bad the `errors` are just unsigned ints and not a collection of `Option<Error>`s that we can print here.


- Ben Mahler


On Feb. 4, 2015, 7:13 p.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/28809/
> -----------------------------------------------------------
> 
> (Updated Feb. 4, 2015, 7:13 p.m.)
> 
> 
> Review request for mesos and Ben Mahler.
> 
> 
> Bugs: MESOS-2031
>     https://issues.apache.org/jira/browse/MESOS-2031
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Started to maintain and checkpoint persisted resource in slave. That includes:
> 1) responds to update resources message
> 2) checkpoint resources
> 3) recover checkpointed resources
> 4) send checkpointed resources during register/reregister
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 70bd8c1fde4ea09fa54c76aa93424a1adb0309f6 
>   src/slave/slave.cpp a8b262174ab5c9a524db8318d3d1438cd75a702b 
> 
> Diff: https://reviews.apache.org/r/28809/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>


Re: Review Request 28809: Started to maintain and checkpoint persisted resource in slave.

Posted by Jie Yu <yu...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28809/
-----------------------------------------------------------

(Updated Feb. 4, 2015, 7:13 p.m.)


Review request for mesos and Ben Mahler.


Changes
-------

Rebased.


Bugs: MESOS-2031
    https://issues.apache.org/jira/browse/MESOS-2031


Repository: mesos


Description
-------

Started to maintain and checkpoint persisted resource in slave. That includes:
1) responds to update resources message
2) checkpoint resources
3) recover checkpointed resources
4) send checkpointed resources during register/reregister


Diffs (updated)
-----

  src/slave/slave.hpp 70bd8c1fde4ea09fa54c76aa93424a1adb0309f6 
  src/slave/slave.cpp a8b262174ab5c9a524db8318d3d1438cd75a702b 

Diff: https://reviews.apache.org/r/28809/diff/


Testing
-------

make check


Thanks,

Jie Yu


Re: Review Request 28809: Started to maintain and checkpoint persisted resource in slave.

Posted by Jie Yu <yu...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28809/
-----------------------------------------------------------

(Updated Jan. 29, 2015, 7:04 p.m.)


Review request for mesos and Ben Mahler.


Bugs: MESOS-2031
    https://issues.apache.org/jira/browse/MESOS-2031


Repository: mesos-git


Description
-------

Started to maintain and checkpoint persisted resource in slave. That includes:
1) responds to update resources message
2) checkpoint resources
3) recover checkpointed resources
4) send checkpointed resources during register/reregister


Diffs
-----

  src/slave/slave.hpp 70bd8c1fde4ea09fa54c76aa93424a1adb0309f6 
  src/slave/slave.cpp a8b262174ab5c9a524db8318d3d1438cd75a702b 

Diff: https://reviews.apache.org/r/28809/diff/


Testing
-------

make check


Thanks,

Jie Yu


Re: Review Request 28809: Started to maintain and checkpoint persisted resource in slave.

Posted by Jie Yu <yu...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28809/
-----------------------------------------------------------

(Updated Jan. 29, 2015, 7:02 p.m.)


Review request for mesos and Ben Mahler.


Changes
-------

Review comments. Rebased.


Bugs: MESOS-2031
    https://issues.apache.org/jira/browse/MESOS-2031


Repository: mesos-git


Description
-------

Started to maintain and checkpoint persisted resource in slave. That includes:
1) responds to update resources message
2) checkpoint resources
3) recover checkpointed resources
4) send checkpointed resources during register/reregister


Diffs (updated)
-----

  src/slave/slave.hpp 70bd8c1fde4ea09fa54c76aa93424a1adb0309f6 
  src/slave/slave.cpp a8b262174ab5c9a524db8318d3d1438cd75a702b 

Diff: https://reviews.apache.org/r/28809/diff/


Testing
-------

make check


Thanks,

Jie Yu


Re: Review Request 28809: Started to maintain and checkpoint persisted resource in slave.

Posted by Jie Yu <yu...@gmail.com>.

> On Jan. 8, 2015, 12:28 a.m., Ben Mahler wrote:
> > src/slave/slave.cpp, lines 1794-1815
> > <https://reviews.apache.org/r/28809/diff/2/?file=785913#file785913line1794>
> >
> >     What about adding a method in slave/state.hpp for checkpointing Resources?
> >     
> >     ```
> >     Try<Nothing> checkpoint(
> >         const std::string& path,
> >         const Resources& resources);
> >     ```

This is done in https://reviews.apache.org/r/29918/


> On Jan. 8, 2015, 12:28 a.m., Ben Mahler wrote:
> > src/slave/slave.cpp, lines 1817-1826
> > <https://reviews.apache.org/r/28809/diff/2/?file=785913#file785913line1817>
> >
> >     Maybe a little NOTE here that we assume that messages are ordered for the releasing to be correct? And that ordering is technically not guaranteed, maybe pointing to the relevant tickets?
> >     
> >     Just want to make sure we know what needs to be done to prevent this from ever biting us.

Added a TODO here explaining the details about the ordering issue.


> On Jan. 8, 2015, 12:28 a.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1830
> > <https://reviews.apache.org/r/28809/diff/2/?file=785913#file785913line1830>
> >
> >     "Updated persistent resources to Y"?
> >     
> >     Or
> >     
> >     "Updated persistent resources from X to Y"?

Fixed.


> On Jan. 8, 2015, 12:28 a.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 3700
> > <https://reviews.apache.org/r/28809/diff/2/?file=785913#file785913line3700>
> >
> >     Shouldn't this CHECK exist in updateResources instead of here? Or in both places?
> >     
> >     Otherwise the slave will create a situation where it will fail CHECKs when it next recovers, is there something else I'm missing?

Doesn't apply any more. We want to get dynamic reservation done as well.


> On Jan. 8, 2015, 12:28 a.m., Ben Mahler wrote:
> > src/slave/slave.cpp, lines 1789-1791
> > <https://reviews.apache.org/r/28809/diff/2/?file=785913#file785913line1789>
> >
> >     Just so I understand, we won't be deleting anything, right? We'll leave volumes "dangling" only when the master fails over during the issue you described here. In that case, the master also thinks that there are no persisted resources on the slave.
> >     
> >     And we won't delete the unknown volumes, we'll leave them dangling on the filesystem, right?. Can you please file a ticket to capture this issue and link it here and in the epic? Don't want to leave it unfixed.
> >     
> >     If the master is up, it should re-send the persistent resources at which point the slave gets them back during re-registration, right?. During registration however, the problem still exists, right?
> >     
> >     Let's document the details here and/or in a ticket!

This TODO has been resolved by using an all-or-nothing atomic checkpoint function.


- Jie


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28809/#review67131
-----------------------------------------------------------


On Dec. 8, 2014, 10:19 p.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/28809/
> -----------------------------------------------------------
> 
> (Updated Dec. 8, 2014, 10:19 p.m.)
> 
> 
> Review request for mesos and Ben Mahler.
> 
> 
> Bugs: MESOS-2031
>     https://issues.apache.org/jira/browse/MESOS-2031
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Started to maintain and checkpoint persisted resource in slave. That includes:
> 1) responds to update resources message
> 2) checkpoint resources
> 3) recover checkpointed resources
> 4) send checkpointed resources during register/reregister
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 70bd8c1fde4ea09fa54c76aa93424a1adb0309f6 
>   src/slave/slave.cpp 9ac64589c353b2f17f538db7de01faa55b2369b9 
> 
> Diff: https://reviews.apache.org/r/28809/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>


Re: Review Request 28809: Started to maintain and checkpoint persisted resource in slave.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28809/#review67131
-----------------------------------------------------------



src/slave/slave.cpp
<https://reviews.apache.org/r/28809/#comment110970>

    Just so I understand, we won't be deleting anything, right? We'll leave volumes "dangling" only when the master fails over during the issue you described here. In that case, the master also thinks that there are no persisted resources on the slave.
    
    And we won't delete the unknown volumes, we'll leave them dangling on the filesystem, right?. Can you please file a ticket to capture this issue and link it here and in the epic? Don't want to leave it unfixed.
    
    If the master is up, it should re-send the persistent resources at which point the slave gets them back during re-registration, right?. During registration however, the problem still exists, right?
    
    Let's document the details here and/or in a ticket!



src/slave/slave.cpp
<https://reviews.apache.org/r/28809/#comment110972>

    What about adding a method in slave/state.hpp for checkpointing Resources?
    
    ```
    Try<Nothing> checkpoint(
        const std::string& path,
        const Resources& resources);
    ```



src/slave/slave.cpp
<https://reviews.apache.org/r/28809/#comment110973>

    Maybe a little NOTE here that we assume that messages are ordered for the releasing to be correct? And that ordering is technically not guaranteed, maybe pointing to the relevant tickets?
    
    Just want to make sure we know what needs to be done to prevent this from ever biting us.



src/slave/slave.cpp
<https://reviews.apache.org/r/28809/#comment110974>

    "Updated persistent resources to Y"?
    
    Or
    
    "Updated persistent resources from X to Y"?



src/slave/slave.cpp
<https://reviews.apache.org/r/28809/#comment110977>

    Shouldn't this CHECK exist in updateResources instead of here? Or in both places?
    
    Otherwise the slave will create a situation where it will fail CHECKs when it next recovers, is there something else I'm missing?


- Ben Mahler


On Dec. 8, 2014, 10:19 p.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/28809/
> -----------------------------------------------------------
> 
> (Updated Dec. 8, 2014, 10:19 p.m.)
> 
> 
> Review request for mesos and Ben Mahler.
> 
> 
> Bugs: MESOS-2031
>     https://issues.apache.org/jira/browse/MESOS-2031
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Started to maintain and checkpoint persisted resource in slave. That includes:
> 1) responds to update resources message
> 2) checkpoint resources
> 3) recover checkpointed resources
> 4) send checkpointed resources during register/reregister
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 70bd8c1fde4ea09fa54c76aa93424a1adb0309f6 
>   src/slave/slave.cpp 9ac64589c353b2f17f538db7de01faa55b2369b9 
> 
> Diff: https://reviews.apache.org/r/28809/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>


Re: Review Request 28809: Started to maintain and checkpoint persisted resource in slave.

Posted by Jie Yu <yu...@gmail.com>.

> On Dec. 22, 2014, 5:43 a.m., Adam B wrote:
> > src/slave/slave.cpp, line 1787
> > <https://reviews.apache.org/r/28809/diff/2/?file=785913#file785913line1787>
> >
> >     What is the source of truth for persistent resources? The checkpoint or the master? What if a framework is trying to launch a task on a slave with a new persistent disk resource, while at the same time, the slave is restarted with new persistent disks added by the slave operator? Assume we can update slave resources without invalidating the SlaveID, and that the master has already started processing the launchTask when the slave tries to reregister.
> >     I'm guessing the slave would reject the UpdateResources call until the slave has successfully re-registered with the master, so the master would have the updated persistentResource, which it would then update with the newly launched task's persistent disk.

Good question! The source of truth is in master in the current design. There are several resaons for this choice. But I guess the main motivation is that master is in the central place of the cluster (in the middle of framework and slaves). Keeping source of truch in master makes it easy for us to maintain a consistent state for the cluster.

Since master can fail over, slave needs to checkpoint persistent volumes and dynamic reservations and report them back to master when master fails over. Also, a newly registered slave needs to tell master about it's persistent volumes and dynamic reservations as well since master has no idea about a new slave.

Now, If an operator wants to add a new persistent volume, he/she has to start the slave as a new slave (new SlaveID). This is the similar to changing SlaveInfo right now. Once we have a way to update SlaveInfo (MESOS-1739), we could use the similar mechanism to update persistent volumes and dynamic reservations.


- Jie


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28809/#review65755
-----------------------------------------------------------


On Dec. 8, 2014, 10:19 p.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/28809/
> -----------------------------------------------------------
> 
> (Updated Dec. 8, 2014, 10:19 p.m.)
> 
> 
> Review request for mesos and Ben Mahler.
> 
> 
> Bugs: MESOS-2031
>     https://issues.apache.org/jira/browse/MESOS-2031
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Started to maintain and checkpoint persisted resource in slave. That includes:
> 1) responds to update resources message
> 2) checkpoint resources
> 3) recover checkpointed resources
> 4) send checkpointed resources during register/reregister
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 70bd8c1fde4ea09fa54c76aa93424a1adb0309f6 
>   src/slave/slave.cpp 9ac64589c353b2f17f538db7de01faa55b2369b9 
> 
> Diff: https://reviews.apache.org/r/28809/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>


Re: Review Request 28809: Started to maintain and checkpoint persisted resource in slave.

Posted by Adam B <ad...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28809/#review65755
-----------------------------------------------------------



src/slave/slave.cpp
<https://reviews.apache.org/r/28809/#comment108998>

    What is the source of truth for persistent resources? The checkpoint or the master? What if a framework is trying to launch a task on a slave with a new persistent disk resource, while at the same time, the slave is restarted with new persistent disks added by the slave operator? Assume we can update slave resources without invalidating the SlaveID, and that the master has already started processing the launchTask when the slave tries to reregister.
    I'm guessing the slave would reject the UpdateResources call until the slave has successfully re-registered with the master, so the master would have the updated persistentResource, which it would then update with the newly launched task's persistent disk.


- Adam B


On Dec. 8, 2014, 2:19 p.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/28809/
> -----------------------------------------------------------
> 
> (Updated Dec. 8, 2014, 2:19 p.m.)
> 
> 
> Review request for mesos and Ben Mahler.
> 
> 
> Bugs: MESOS-2031
>     https://issues.apache.org/jira/browse/MESOS-2031
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Started to maintain and checkpoint persisted resource in slave. That includes:
> 1) responds to update resources message
> 2) checkpoint resources
> 3) recover checkpointed resources
> 4) send checkpointed resources during register/reregister
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 70bd8c1fde4ea09fa54c76aa93424a1adb0309f6 
>   src/slave/slave.cpp 9ac64589c353b2f17f538db7de01faa55b2369b9 
> 
> Diff: https://reviews.apache.org/r/28809/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>


Re: Review Request 28809: Started to maintain and checkpoint persisted resource in slave.

Posted by Jie Yu <yu...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28809/
-----------------------------------------------------------

(Updated Dec. 8, 2014, 10:19 p.m.)


Review request for mesos and Ben Mahler.


Bugs: MESOS-2031
    https://issues.apache.org/jira/browse/MESOS-2031


Repository: mesos-git


Description
-------

Started to maintain and checkpoint persisted resource in slave. That includes:
1) responds to update resources message
2) checkpoint resources
3) recover checkpointed resources
4) send checkpointed resources during register/reregister


Diffs (updated)
-----

  src/slave/slave.hpp 70bd8c1fde4ea09fa54c76aa93424a1adb0309f6 
  src/slave/slave.cpp 9ac64589c353b2f17f538db7de01faa55b2369b9 

Diff: https://reviews.apache.org/r/28809/diff/


Testing
-------

make check


Thanks,

Jie Yu


Re: Review Request 28809: Started to maintain and checkpoint persisted resource in slave.

Posted by Mesos ReviewBot <de...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28809/#review64281
-----------------------------------------------------------


Bad patch!

Reviews applied: [28711, 28720]

Failed command: ./support/apply-review.sh -n -r 28720

Error:
 2014-12-08 21:09:03 URL:https://reviews.apache.org/r/28720/diff/raw/ [9603/9603] -> "28720.patch" [1]
error: patch failed: src/master/master.cpp:1796
error: src/master/master.cpp: patch does not apply
error: patch failed: src/tests/resource_offers_tests.cpp:256
error: src/tests/resource_offers_tests.cpp: patch does not apply
Failed to apply patch

- Mesos ReviewBot


On Dec. 8, 2014, 8:43 p.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/28809/
> -----------------------------------------------------------
> 
> (Updated Dec. 8, 2014, 8:43 p.m.)
> 
> 
> Review request for mesos and Ben Mahler.
> 
> 
> Bugs: MESOS-2031
>     https://issues.apache.org/jira/browse/MESOS-2031
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Started to maintain and checkpoint persisted resource in slave. That includes:
> 1) responds to update resources message
> 2) checkpoint resources
> 3) recover checkpointed resources
> 4) send checkpointed resources during register/reregister
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 70bd8c1fde4ea09fa54c76aa93424a1adb0309f6 
>   src/slave/slave.cpp 9ac64589c353b2f17f538db7de01faa55b2369b9 
> 
> Diff: https://reviews.apache.org/r/28809/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>