You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@aurora.apache.org by Jordan Ly <jo...@gmail.com> on 2018/03/07 05:50:35 UTC

Review Request 65941: Avoid scheduling on the same host the ancestor of a task recently failed on

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65941/
-----------------------------------------------------------

Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and Stephan Erb.


Repository: aurora


Description
-------

If a task fails on a host, we should try to avoid rescheduling the task on the same host if possible. This is done in order to avoid a potentially bad host. This issue generally comes up when you are bin-packing hosts (i.e. using the `-offer_order` option).

If there are no other offers to schedule the task on, we will still use the offer.


Diffs
-----

  src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java fcafecf63040f9c410458dedfd3d87b0d669d205 
  src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java 864538b6730d7318385494818276ba370124b8e9 


Diff: https://reviews.apache.org/r/65941/diff/1/


Testing
-------

`./gradlew test`

Benchmarks and live-cluster testing coming soon.


Thanks,

Jordan Ly


Re: Review Request 65941: Avoid scheduling on the same host the ancestor of a task recently failed on

Posted by Jordan Ly <jo...@gmail.com>.

> On March 7, 2018, 6:48 p.m., David McLaughlin wrote:
> > So what happens if there are two bad hosts? :)

This does not scale past n=1

We can make this more generic by getting the list of hosts the task has previously failed on and looking through offers for a host the task did not fail on for some operator defined value (something like `-failure_avoidance_factor`)


- Jordan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65941/#review198803
-----------------------------------------------------------


On March 7, 2018, 5:50 a.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65941/
> -----------------------------------------------------------
> 
> (Updated March 7, 2018, 5:50 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> If a task fails on a host, we should try to avoid rescheduling the task on the same host if possible. This is done in order to avoid a potentially bad host. This issue generally comes up when you are bin-packing hosts (i.e. using the `-offer_order` option).
> 
> If there are no other offers to schedule the task on, we will still use the offer.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java fcafecf63040f9c410458dedfd3d87b0d669d205 
>   src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java 864538b6730d7318385494818276ba370124b8e9 
> 
> 
> Diff: https://reviews.apache.org/r/65941/diff/1/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Benchmarks and live-cluster testing coming soon.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 65941: Avoid scheduling on the same host the ancestor of a task recently failed on

Posted by Santhosh Kumar Shanmugham <sa...@gmail.com>.

> On March 7, 2018, 10:48 a.m., David McLaughlin wrote:
> > So what happens if there are two bad hosts? :)
> 
> Jordan Ly wrote:
>     This does not scale past n=1
>     
>     We can make this more generic by getting the list of hosts the task has previously failed on and looking through offers for a host the task did not fail on for some operator defined value (something like `-failure_avoidance_factor`)

Note making this more generic is still incumbent on the amount of task history we have on the scheduler.


- Santhosh Kumar


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65941/#review198803
-----------------------------------------------------------


On March 6, 2018, 9:50 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65941/
> -----------------------------------------------------------
> 
> (Updated March 6, 2018, 9:50 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> If a task fails on a host, we should try to avoid rescheduling the task on the same host if possible. This is done in order to avoid a potentially bad host. This issue generally comes up when you are bin-packing hosts (i.e. using the `-offer_order` option).
> 
> If there are no other offers to schedule the task on, we will still use the offer.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java fcafecf63040f9c410458dedfd3d87b0d669d205 
>   src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java 864538b6730d7318385494818276ba370124b8e9 
> 
> 
> Diff: https://reviews.apache.org/r/65941/diff/1/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Benchmarks and live-cluster testing coming soon.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 65941: Avoid scheduling on the same host the ancestor of a task recently failed on

Posted by Stephan Erb <se...@apache.org>.

> On March 7, 2018, 7:48 p.m., David McLaughlin wrote:
> > So what happens if there are two bad hosts? :)
> 
> Jordan Ly wrote:
>     This does not scale past n=1
>     
>     We can make this more generic by getting the list of hosts the task has previously failed on and looking through offers for a host the task did not fail on for some operator defined value (something like `-failure_avoidance_factor`)
> 
> Santhosh Kumar Shanmugham wrote:
>     Note making this more generic is still incumbent on the amount of task history we have on the scheduler.
> 
> Jordan Ly wrote:
>     Discussed offline:
>     
>     Going to go a different route -- this method is very domain-specific and does not allow for preemption to kick in since if there is only one host matching and it is bad you can still be repeatedly scheduled on it. Instead, going to go a more generic solution involving banning scheduling on a host temporarily if the task fails on that host via `SchedulingFilter`. This would be enabled through a operator-defined option.

Different idea: If the ancestor was LOST or FAILED, use a coin-flip to decide if we want to use a matching offer or not. This does not require additional state and gives sufficient chance for the task to come up in one of the future scheduling rounds. As it would be only used for re-scheduled tasks, it does not lead to a performance impact in the normal case.


- Stephan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65941/#review198803
-----------------------------------------------------------


On March 7, 2018, 6:50 a.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65941/
> -----------------------------------------------------------
> 
> (Updated March 7, 2018, 6:50 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> If a task fails on a host, we should try to avoid rescheduling the task on the same host if possible. This is done in order to avoid a potentially bad host. This issue generally comes up when you are bin-packing hosts (i.e. using the `-offer_order` option).
> 
> If there are no other offers to schedule the task on, we will still use the offer.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java fcafecf63040f9c410458dedfd3d87b0d669d205 
>   src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java 864538b6730d7318385494818276ba370124b8e9 
> 
> 
> Diff: https://reviews.apache.org/r/65941/diff/1/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Benchmarks and live-cluster testing coming soon.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 65941: Avoid scheduling on the same host the ancestor of a task recently failed on

Posted by Jordan Ly <jo...@gmail.com>.

> On March 7, 2018, 6:48 p.m., David McLaughlin wrote:
> > So what happens if there are two bad hosts? :)
> 
> Jordan Ly wrote:
>     This does not scale past n=1
>     
>     We can make this more generic by getting the list of hosts the task has previously failed on and looking through offers for a host the task did not fail on for some operator defined value (something like `-failure_avoidance_factor`)
> 
> Santhosh Kumar Shanmugham wrote:
>     Note making this more generic is still incumbent on the amount of task history we have on the scheduler.

Discussed offline:

Going to go a different route -- this method is very domain-specific and does not allow for preemption to kick in since if there is only one host matching and it is bad you can still be repeatedly scheduled on it. Instead, going to go a more generic solution involving banning scheduling on a host temporarily if the task fails on that host via `SchedulingFilter`. This would be enabled through a operator-defined option.


- Jordan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65941/#review198803
-----------------------------------------------------------


On March 7, 2018, 5:50 a.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65941/
> -----------------------------------------------------------
> 
> (Updated March 7, 2018, 5:50 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> If a task fails on a host, we should try to avoid rescheduling the task on the same host if possible. This is done in order to avoid a potentially bad host. This issue generally comes up when you are bin-packing hosts (i.e. using the `-offer_order` option).
> 
> If there are no other offers to schedule the task on, we will still use the offer.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java fcafecf63040f9c410458dedfd3d87b0d669d205 
>   src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java 864538b6730d7318385494818276ba370124b8e9 
> 
> 
> Diff: https://reviews.apache.org/r/65941/diff/1/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Benchmarks and live-cluster testing coming soon.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 65941: Avoid scheduling on the same host the ancestor of a task recently failed on

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65941/#review198803
-----------------------------------------------------------



So what happens if there are two bad hosts? :)

- David McLaughlin


On March 7, 2018, 5:50 a.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65941/
> -----------------------------------------------------------
> 
> (Updated March 7, 2018, 5:50 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> If a task fails on a host, we should try to avoid rescheduling the task on the same host if possible. This is done in order to avoid a potentially bad host. This issue generally comes up when you are bin-packing hosts (i.e. using the `-offer_order` option).
> 
> If there are no other offers to schedule the task on, we will still use the offer.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java fcafecf63040f9c410458dedfd3d87b0d669d205 
>   src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java 864538b6730d7318385494818276ba370124b8e9 
> 
> 
> Diff: https://reviews.apache.org/r/65941/diff/1/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Benchmarks and live-cluster testing coming soon.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 65941: Avoid scheduling on the same host the ancestor of a task recently failed on

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65941/#review198769
-----------------------------------------------------------


Ship it!




Master (a12b844) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On March 7, 2018, 1:50 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65941/
> -----------------------------------------------------------
> 
> (Updated March 7, 2018, 1:50 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> If a task fails on a host, we should try to avoid rescheduling the task on the same host if possible. This is done in order to avoid a potentially bad host. This issue generally comes up when you are bin-packing hosts (i.e. using the `-offer_order` option).
> 
> If there are no other offers to schedule the task on, we will still use the offer.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java fcafecf63040f9c410458dedfd3d87b0d669d205 
>   src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java 864538b6730d7318385494818276ba370124b8e9 
> 
> 
> Diff: https://reviews.apache.org/r/65941/diff/1/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Benchmarks and live-cluster testing coming soon.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>