You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Chun-Hung Hsiao <ch...@apache.org> on 2019/03/01 05:52:59 UTC

Review Request 70081: Do not fail a task if it doesn't use resources from a failed provider.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70081/
-----------------------------------------------------------

Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.


Bugs: MESOS-9607
    https://issues.apache.org/jira/browse/MESOS-9607


Repository: mesos


Description
-------

`Slave::publishResources` will no longer ask all resource providers to
publish all allocated resources. Instead, it only asks those of the
task's resources to publish resources, so a failed resource provider
would only fail tasks that want to use its resources.


Diffs
-----

  src/resource_provider/manager.cpp 2cde62a1849b7d595841fb845033640b537b844d 
  src/slave/slave.hpp 7ad495504e4ff144ac31812fbd4a3a1f4da86f02 
  src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 


Diff: https://reviews.apache.org/r/70081/diff/1/


Testing
-------

make check


Thanks,

Chun-Hung Hsiao


Re: Review Request 70081: Do not fail a task if it doesn't use resources from a failed provider.

Posted by Chun-Hung Hsiao <ch...@apache.org>.

> On March 1, 2019, 10:15 p.m., Benjamin Bannier wrote:
> > src/slave/slave.cpp
> > Line 8777 (original), 8785 (patched)
> > <https://reviews.apache.org/r/70081/diff/2/?file=2128176#file2128176line8801>
> >
> >     We potentially publish to multiple resource providers here where each could fail independently. Do we need to perform any cleanup if we were able to publish to all but one RP? By using `collect` (which is dictated by the return type) we seem to make it hard for the caller to perform such cleanup, but am unsure we can perform the cleanup here (it might also fail).
> >     
> >     I am missing something which makes this a no-issue or would we need to implement above `TODO` to make this work?

Because resource allocation lifecycles are tied to task lifecycles: as soon as a task is completed, the master would immediately offer out any resource provider resource previously used by the task. This means that it would be very complicated to do cleanups (I assume you're talking about things like `NodeUnpublishVolume`) asynchronously as they might race with another `PUBLISH_RESOURCES`. Therefore, all cleanups are done *lazily* only when necessary. This is also related to why we cannot use diff-based resource publishing.

Moreover, we currently don't do any `NodeUnpublishVolume` until the disk is destroyed. A subsequent task would just bind-mount the already-published volume into their sandbox.

So the bottom line is we intentionally make `PUBLISH_RESOURES` an operation that requires no cleanup.


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70081/#review213337
-----------------------------------------------------------


On March 1, 2019, 8:11 p.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70081/
> -----------------------------------------------------------
> 
> (Updated March 1, 2019, 8:11 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9607
>     https://issues.apache.org/jira/browse/MESOS-9607
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> `Slave::publishResources` will no longer ask all resource providers to
> publish all allocated resources. Instead, it only asks those of the
> task's resources to publish resources, so a failed resource provider
> would only fail tasks that want to use its resources.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/manager.cpp 2cde62a1849b7d595841fb845033640b537b844d 
>   src/slave/slave.hpp 7ad495504e4ff144ac31812fbd4a3a1f4da86f02 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/70081/diff/2/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 70081: Do not fail a task if it doesn't use resources from a failed provider.

Posted by Benjamin Bannier <be...@mesosphere.io>.

> On March 1, 2019, 11:15 p.m., Benjamin Bannier wrote:
> > src/slave/slave.cpp
> > Line 8777 (original), 8785 (patched)
> > <https://reviews.apache.org/r/70081/diff/2/?file=2128176#file2128176line8801>
> >
> >     We potentially publish to multiple resource providers here where each could fail independently. Do we need to perform any cleanup if we were able to publish to all but one RP? By using `collect` (which is dictated by the return type) we seem to make it hard for the caller to perform such cleanup, but am unsure we can perform the cleanup here (it might also fail).
> >     
> >     I am missing something which makes this a no-issue or would we need to implement above `TODO` to make this work?
> 
> Chun-Hung Hsiao wrote:
>     Because resource allocation lifecycles are tied to task lifecycles: as soon as a task is completed, the master would immediately offer out any resource provider resource previously used by the task. This means that it would be very complicated to do cleanups (I assume you're talking about things like `NodeUnpublishVolume`) asynchronously as they might race with another `PUBLISH_RESOURCES`. Therefore, all cleanups are done *lazily* only when necessary. This is also related to why we cannot use diff-based resource publishing.
>     
>     Moreover, we currently don't do any `NodeUnpublishVolume` until the disk is destroyed. A subsequent task would just bind-mount the already-published volume into their sandbox.
>     
>     So the bottom line is we intentionally make `PUBLISH_RESOURES` an operation that requires no cleanup.

Awesome, mind adding a comment on no cleanup being necessary here?

Dropping the issue.


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70081/#review213337
-----------------------------------------------------------


On March 2, 2019, 12:46 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70081/
> -----------------------------------------------------------
> 
> (Updated March 2, 2019, 12:46 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9607
>     https://issues.apache.org/jira/browse/MESOS-9607
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> `Slave::publishResources` will no longer ask all resource providers to
> publish all allocated resources. Instead, it only asks those of the
> task's resources to publish resources, so a failed resource provider
> would only fail tasks that want to use its resources.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/manager.cpp 2cde62a1849b7d595841fb845033640b537b844d 
>   src/slave/slave.hpp 7ad495504e4ff144ac31812fbd4a3a1f4da86f02 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/70081/diff/3/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 70081: Do not fail a task if it doesn't use resources from a failed provider.

Posted by Benjamin Bannier <be...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70081/#review213337
-----------------------------------------------------------




src/slave/slave.cpp
Line 8777 (original), 8785 (patched)
<https://reviews.apache.org/r/70081/#comment299292>

    We potentially publish to multiple resource providers here where each could fail independently. Do we need to perform any cleanup if we were able to publish to all but one RP? By using `collect` (which is dictated by the return type) we seem to make it hard for the caller to perform such cleanup, but am unsure we can perform the cleanup here (it might also fail).
    
    I am missing something which makes this a no-issue or would we need to implement above `TODO` to make this work?


- Benjamin Bannier


On March 1, 2019, 9:11 p.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70081/
> -----------------------------------------------------------
> 
> (Updated March 1, 2019, 9:11 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9607
>     https://issues.apache.org/jira/browse/MESOS-9607
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> `Slave::publishResources` will no longer ask all resource providers to
> publish all allocated resources. Instead, it only asks those of the
> task's resources to publish resources, so a failed resource provider
> would only fail tasks that want to use its resources.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/manager.cpp 2cde62a1849b7d595841fb845033640b537b844d 
>   src/slave/slave.hpp 7ad495504e4ff144ac31812fbd4a3a1f4da86f02 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/70081/diff/2/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 70081: Do not fail a task if it doesn't use resources from a failed provider.

Posted by Chun-Hung Hsiao <ch...@apache.org>.

> On March 4, 2019, 11:47 a.m., Benjamin Bannier wrote:
> > LGTM. Would be great if we could address https://reviews.apache.org/r/70081/#comment_rc299292-213391 before merging.

Done:
```
  // NOTE: Resource cleanups (e.g., unpublishing) are not performed at task
  // completion, but rather done __lazily__ when necessary. This is not just an
  // optimization but required because resource allocations are tied to task
  // lifecycles. As a result, no cleanup is needed here if any future fails.
```


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70081/#review213392
-----------------------------------------------------------


On March 1, 2019, 11:46 p.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70081/
> -----------------------------------------------------------
> 
> (Updated March 1, 2019, 11:46 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9607
>     https://issues.apache.org/jira/browse/MESOS-9607
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> `Slave::publishResources` will no longer ask all resource providers to
> publish all allocated resources. Instead, it only asks those of the
> task's resources to publish resources, so a failed resource provider
> would only fail tasks that want to use its resources.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/manager.cpp 2cde62a1849b7d595841fb845033640b537b844d 
>   src/slave/slave.hpp 7ad495504e4ff144ac31812fbd4a3a1f4da86f02 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/70081/diff/3/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 70081: Do not fail a task if it doesn't use resources from a failed provider.

Posted by Benjamin Bannier <be...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70081/#review213392
-----------------------------------------------------------


Ship it!




LGTM. Would be great if we could address https://reviews.apache.org/r/70081/#comment_rc299292-213391 before merging.

- Benjamin Bannier


On March 2, 2019, 12:46 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70081/
> -----------------------------------------------------------
> 
> (Updated March 2, 2019, 12:46 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9607
>     https://issues.apache.org/jira/browse/MESOS-9607
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> `Slave::publishResources` will no longer ask all resource providers to
> publish all allocated resources. Instead, it only asks those of the
> task's resources to publish resources, so a failed resource provider
> would only fail tasks that want to use its resources.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/manager.cpp 2cde62a1849b7d595841fb845033640b537b844d 
>   src/slave/slave.hpp 7ad495504e4ff144ac31812fbd4a3a1f4da86f02 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/70081/diff/3/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 70081: Do not fail a task if it doesn't use resources from a failed provider.

Posted by Chun-Hung Hsiao <ch...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70081/
-----------------------------------------------------------

(Updated March 4, 2019, 7:48 p.m.)


Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.


Changes
-------

Addressed Benjamin's comment.


Bugs: MESOS-9607
    https://issues.apache.org/jira/browse/MESOS-9607


Repository: mesos


Description
-------

`Slave::publishResources` will no longer ask all resource providers to
publish all allocated resources. Instead, it only asks those of the
task's resources to publish resources, so a failed resource provider
would only fail tasks that want to use its resources.


Diffs (updated)
-----

  src/resource_provider/manager.cpp 2cde62a1849b7d595841fb845033640b537b844d 
  src/slave/slave.hpp 7ad495504e4ff144ac31812fbd4a3a1f4da86f02 
  src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 


Diff: https://reviews.apache.org/r/70081/diff/4/

Changes: https://reviews.apache.org/r/70081/diff/3-4/


Testing
-------

make check


Thanks,

Chun-Hung Hsiao


Re: Review Request 70081: Do not fail a task if it doesn't use resources from a failed provider.

Posted by Chun-Hung Hsiao <ch...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70081/
-----------------------------------------------------------

(Updated March 1, 2019, 11:46 p.m.)


Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.


Changes
-------

Rebased.


Bugs: MESOS-9607
    https://issues.apache.org/jira/browse/MESOS-9607


Repository: mesos


Description
-------

`Slave::publishResources` will no longer ask all resource providers to
publish all allocated resources. Instead, it only asks those of the
task's resources to publish resources, so a failed resource provider
would only fail tasks that want to use its resources.


Diffs (updated)
-----

  src/resource_provider/manager.cpp 2cde62a1849b7d595841fb845033640b537b844d 
  src/slave/slave.hpp 7ad495504e4ff144ac31812fbd4a3a1f4da86f02 
  src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 


Diff: https://reviews.apache.org/r/70081/diff/3/

Changes: https://reviews.apache.org/r/70081/diff/2-3/


Testing
-------

make check


Thanks,

Chun-Hung Hsiao


Re: Review Request 70081: Do not fail a task if it doesn't use resources from a failed provider.

Posted by Mesos Reviewbot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70081/#review213338
-----------------------------------------------------------



Patch looks great!

Reviews applied: [70080, 70082, 70081]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose --disable-libtool-wrappers --disable-parallel-test-execution' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh

- Mesos Reviewbot


On March 1, 2019, 8:11 p.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70081/
> -----------------------------------------------------------
> 
> (Updated March 1, 2019, 8:11 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9607
>     https://issues.apache.org/jira/browse/MESOS-9607
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> `Slave::publishResources` will no longer ask all resource providers to
> publish all allocated resources. Instead, it only asks those of the
> task's resources to publish resources, so a failed resource provider
> would only fail tasks that want to use its resources.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/manager.cpp 2cde62a1849b7d595841fb845033640b537b844d 
>   src/slave/slave.hpp 7ad495504e4ff144ac31812fbd4a3a1f4da86f02 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/70081/diff/2/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 70081: Do not fail a task if it doesn't use resources from a failed provider.

Posted by Chun-Hung Hsiao <ch...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70081/
-----------------------------------------------------------

(Updated March 1, 2019, 8:11 p.m.)


Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.


Changes
-------

Moved the `id::UUID` changes out of this patch.


Bugs: MESOS-9607
    https://issues.apache.org/jira/browse/MESOS-9607


Repository: mesos


Description
-------

`Slave::publishResources` will no longer ask all resource providers to
publish all allocated resources. Instead, it only asks those of the
task's resources to publish resources, so a failed resource provider
would only fail tasks that want to use its resources.


Diffs (updated)
-----

  src/resource_provider/manager.cpp 2cde62a1849b7d595841fb845033640b537b844d 
  src/slave/slave.hpp 7ad495504e4ff144ac31812fbd4a3a1f4da86f02 
  src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 


Diff: https://reviews.apache.org/r/70081/diff/2/

Changes: https://reviews.apache.org/r/70081/diff/1-2/


Testing
-------

make check


Thanks,

Chun-Hung Hsiao


Re: Review Request 70081: Do not fail a task if it doesn't use resources from a failed provider.

Posted by Mesos Reviewbot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70081/#review213326
-----------------------------------------------------------



Patch looks great!

Reviews applied: [70080, 70081]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose --disable-libtool-wrappers --disable-parallel-test-execution' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh

- Mesos Reviewbot


On March 1, 2019, 5:52 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70081/
> -----------------------------------------------------------
> 
> (Updated March 1, 2019, 5:52 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9607
>     https://issues.apache.org/jira/browse/MESOS-9607
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> `Slave::publishResources` will no longer ask all resource providers to
> publish all allocated resources. Instead, it only asks those of the
> task's resources to publish resources, so a failed resource provider
> would only fail tasks that want to use its resources.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/manager.cpp 2cde62a1849b7d595841fb845033640b537b844d 
>   src/slave/slave.hpp 7ad495504e4ff144ac31812fbd4a3a1f4da86f02 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/70081/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>


Re: Review Request 70081: Do not fail a task if it doesn't use resources from a failed provider.

Posted by Mesos Reviewbot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70081/#review213325
-----------------------------------------------------------



Patch looks great!

Reviews applied: [70080, 70081]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose --disable-libtool-wrappers --disable-parallel-test-execution' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh

- Mesos Reviewbot


On March 1, 2019, 5:52 a.m., Chun-Hung Hsiao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70081/
> -----------------------------------------------------------
> 
> (Updated March 1, 2019, 5:52 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Jie Yu, and Jan Schlicht.
> 
> 
> Bugs: MESOS-9607
>     https://issues.apache.org/jira/browse/MESOS-9607
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> `Slave::publishResources` will no longer ask all resource providers to
> publish all allocated resources. Instead, it only asks those of the
> task's resources to publish resources, so a failed resource provider
> would only fail tasks that want to use its resources.
> 
> 
> Diffs
> -----
> 
>   src/resource_provider/manager.cpp 2cde62a1849b7d595841fb845033640b537b844d 
>   src/slave/slave.hpp 7ad495504e4ff144ac31812fbd4a3a1f4da86f02 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/70081/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>