You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@aurora.apache.org by David McLaughlin <da...@dmclaughlin.com> on 2017/05/23 07:41:19 UTC

Review Request 59480: Expose bin-packing options via OfferManager ordering.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/
-----------------------------------------------------------

Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.


Repository: aurora


Description
-------

This patch enables scalable, high-performance Scheduler bin-packing using the existing first-fit task assigner, and it can be controlled with a simple command line argument. 

The bin-packing is only an approximation, but can lead to pretty significant improvements in resource utilization per agent. For example, on a CPU-bound cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to reduce the number of hosts with tasks scheduled on them to just 90%, down from 99.7% (as one would expect from randomization). So if you are running Aurora on elastic computing and paying for machines by the minute/hour, then utilizing this patch _could_ allow you to reduce your server footprint by as much as 10%. 

The approximation is based on the simple idea that you have the best chance of having perfect bin-packing if you put tasks in the smallest slot available. So if you have a task needing 8 cores and you have an 8 core and 12 core offer available - you'd always want to put the task in the 8 core offer*. By sorting offers in OfferManager during iteration, then a first-fit algorithm is guaranteed to match the smallest possible offer for your task and achieves this.

* - The correct decision of course depends on the other pending tasks and the other resources available, and more satisfactory results may also need preemption, etc.


Diffs
-----

  RELEASE-NOTES.md 77376e438bd7af74c364dcd5d1b3e3f1ece2adbf 
  src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java f2296a9d7a88be7e43124370edecfe64415df00f 
  src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
  src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
  src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
  src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java d7addc0effb60c196cf339081ad81de541d05385 
  src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 676d305d257585e53f0a05b359ba7eb11f1b23be 


Diff: https://reviews.apache.org/r/59480/diff/1/


Testing
-------

This has been scale-tested with production-like workloads and performs well, adding only a few extra seconds total in TaskAssigner when applied to thousands of tasks per minute. 

There is an  overhead when scheduling tasks that have large resource requirements - as the task assigner will first need to skip offer all the offers with low resources. In a packed cluster, this is where the extra seconds are spent. This could be reduced by just jumping over all the offers we know to be too small, but that decision has to map to the OfferOrder (which adds complexity). That can be addressed in a follow-up review if needed.


Thanks,

David McLaughlin


Re: Review Request 59480: Expose bin-packing options via OfferManager ordering.

Posted by David McLaughlin <da...@dmclaughlin.com>.

> On May 24, 2017, 7:47 p.m., Jordan Ly wrote:
> > src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java
> > Lines 308 (patched)
> > <https://reviews.apache.org/r/59480/diff/1/?file=1727627#file1727627line308>
> >
> >     Not sure how practical this is but it would be nice for users to specify arbitrary order preference. For example, we have CPU_MEMORY right now but what if wanted DISK_CPU?
> >     
> >     If we could input a list like [CPU, DISK] in the command line for preferred ordering that would be ideal.

Thanks, I implemented this suggestion.


> On May 24, 2017, 7:47 p.m., Jordan Ly wrote:
> > src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java
> > Lines 312 (patched)
> > <https://reviews.apache.org/r/59480/diff/1/?file=1727627#file1727627line312>
> >
> >     Do you also need to compound the RANDOM_COMPARATOR TO this (in case two machines have the same cpu/RAM amounts)?

Now user would add RANDOM if they want that (without it, will default to FIFO on tie breaks).


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/#review175978
-----------------------------------------------------------


On May 31, 2017, 9:41 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59480/
> -----------------------------------------------------------
> 
> (Updated May 31, 2017, 9:41 p.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch enables scalable, high-performance Scheduler bin-packing using the existing first-fit task assigner, and it can be controlled with a simple command line argument. 
> 
> The bin-packing is only an approximation, but can lead to pretty significant improvements in resource utilization per agent. For example, on a CPU-bound cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to reduce the number of hosts with tasks scheduled on them to just 90%, down from 99.7% (as one would expect from randomization). So if you are running Aurora on elastic computing and paying for machines by the minute/hour, then utilizing this patch _could_ allow you to reduce your server footprint by as much as 10%. 
> 
> The approximation is based on the simple idea that you have the best chance of having perfect bin-packing if you put tasks in the smallest slot available. So if you have a task needing 8 cores and you have an 8 core and 12 core offer available - you'd always want to put the task in the 8 core offer*. By sorting offers in OfferManager during iteration, then a first-fit algorithm is guaranteed to match the smallest possible offer for your task and achieves this.
> 
> * - The correct decision of course depends on the other pending tasks and the other resources available, and more satisfactory results may also need preemption, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md 75b3ddb856dc5d889a9006490f57cc58ee7d82fc 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java b933a5bbc2d1d5edb5e473135fb523a1fe02db35 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrderBuilder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
>   src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java d7addc0effb60c196cf339081ad81de541d05385 
>   src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 676d305d257585e53f0a05b359ba7eb11f1b23be 
> 
> 
> Diff: https://reviews.apache.org/r/59480/diff/2/
> 
> 
> Testing
> -------
> 
> This has been scale-tested with production-like workloads and performs well, adding only a few extra seconds total in TaskAssigner when applied to thousands of tasks per minute. 
> 
> There is an  overhead when scheduling tasks that have large resource requirements - as the task assigner will first need to skip offer all the offers with low resources. In a packed cluster, this is where the extra seconds are spent. This could be reduced by just jumping over all the offers we know to be too small, but that decision has to map to the OfferOrder (which adds complexity). That can be addressed in a follow-up review if needed.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 59480: Expose bin-packing options via OfferManager ordering.

Posted by Jordan Ly <jo...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/#review175978
-----------------------------------------------------------



This is awesome!


src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java
Lines 308 (patched)
<https://reviews.apache.org/r/59480/#comment249327>

    Not sure how practical this is but it would be nice for users to specify arbitrary order preference. For example, we have CPU_MEMORY right now but what if wanted DISK_CPU?
    
    If we could input a list like [CPU, DISK] in the command line for preferred ordering that would be ideal.



src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java
Lines 312 (patched)
<https://reviews.apache.org/r/59480/#comment249326>

    Do you also need to compound the RANDOM_COMPARATOR TO this (in case two machines have the same cpu/RAM amounts)?


- Jordan Ly


On May 23, 2017, 7:41 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59480/
> -----------------------------------------------------------
> 
> (Updated May 23, 2017, 7:41 a.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch enables scalable, high-performance Scheduler bin-packing using the existing first-fit task assigner, and it can be controlled with a simple command line argument. 
> 
> The bin-packing is only an approximation, but can lead to pretty significant improvements in resource utilization per agent. For example, on a CPU-bound cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to reduce the number of hosts with tasks scheduled on them to just 90%, down from 99.7% (as one would expect from randomization). So if you are running Aurora on elastic computing and paying for machines by the minute/hour, then utilizing this patch _could_ allow you to reduce your server footprint by as much as 10%. 
> 
> The approximation is based on the simple idea that you have the best chance of having perfect bin-packing if you put tasks in the smallest slot available. So if you have a task needing 8 cores and you have an 8 core and 12 core offer available - you'd always want to put the task in the 8 core offer*. By sorting offers in OfferManager during iteration, then a first-fit algorithm is guaranteed to match the smallest possible offer for your task and achieves this.
> 
> * - The correct decision of course depends on the other pending tasks and the other resources available, and more satisfactory results may also need preemption, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md 77376e438bd7af74c364dcd5d1b3e3f1ece2adbf 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java f2296a9d7a88be7e43124370edecfe64415df00f 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
>   src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java d7addc0effb60c196cf339081ad81de541d05385 
>   src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 676d305d257585e53f0a05b359ba7eb11f1b23be 
> 
> 
> Diff: https://reviews.apache.org/r/59480/diff/1/
> 
> 
> Testing
> -------
> 
> This has been scale-tested with production-like workloads and performs well, adding only a few extra seconds total in TaskAssigner when applied to thousands of tasks per minute. 
> 
> There is an  overhead when scheduling tasks that have large resource requirements - as the task assigner will first need to skip offer all the offers with low resources. In a packed cluster, this is where the extra seconds are spent. This could be reduced by just jumping over all the offers we know to be too small, but that decision has to map to the OfferOrder (which adds complexity). That can be addressed in a follow-up review if needed.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 59480: Expose bin-packing options via OfferManager ordering.

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/#review175771
-----------------------------------------------------------


Ship it!




Master (4c0974b) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On May 23, 2017, 7:41 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59480/
> -----------------------------------------------------------
> 
> (Updated May 23, 2017, 7:41 a.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch enables scalable, high-performance Scheduler bin-packing using the existing first-fit task assigner, and it can be controlled with a simple command line argument. 
> 
> The bin-packing is only an approximation, but can lead to pretty significant improvements in resource utilization per agent. For example, on a CPU-bound cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to reduce the number of hosts with tasks scheduled on them to just 90%, down from 99.7% (as one would expect from randomization). So if you are running Aurora on elastic computing and paying for machines by the minute/hour, then utilizing this patch _could_ allow you to reduce your server footprint by as much as 10%. 
> 
> The approximation is based on the simple idea that you have the best chance of having perfect bin-packing if you put tasks in the smallest slot available. So if you have a task needing 8 cores and you have an 8 core and 12 core offer available - you'd always want to put the task in the 8 core offer*. By sorting offers in OfferManager during iteration, then a first-fit algorithm is guaranteed to match the smallest possible offer for your task and achieves this.
> 
> * - The correct decision of course depends on the other pending tasks and the other resources available, and more satisfactory results may also need preemption, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md 77376e438bd7af74c364dcd5d1b3e3f1ece2adbf 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java f2296a9d7a88be7e43124370edecfe64415df00f 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
>   src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java d7addc0effb60c196cf339081ad81de541d05385 
>   src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 676d305d257585e53f0a05b359ba7eb11f1b23be 
> 
> 
> Diff: https://reviews.apache.org/r/59480/diff/1/
> 
> 
> Testing
> -------
> 
> This has been scale-tested with production-like workloads and performs well, adding only a few extra seconds total in TaskAssigner when applied to thousands of tasks per minute. 
> 
> There is an  overhead when scheduling tasks that have large resource requirements - as the task assigner will first need to skip offer all the offers with low resources. In a packed cluster, this is where the extra seconds are spent. This could be reduced by just jumping over all the offers we know to be too small, but that decision has to map to the OfferOrder (which adds complexity). That can be addressed in a follow-up review if needed.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 59480: Expose bin-packing options via OfferManager ordering.

Posted by Santhosh Kumar Shanmugham <sa...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/#review175980
-----------------------------------------------------------




src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java
Lines 308-314 (patched)
<https://reviews.apache.org/r/59480/#comment249328>

    Probably compund RANDOM at the end of the switch to avoid missing randomization.


- Santhosh Kumar Shanmugham


On May 23, 2017, 12:41 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59480/
> -----------------------------------------------------------
> 
> (Updated May 23, 2017, 12:41 a.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch enables scalable, high-performance Scheduler bin-packing using the existing first-fit task assigner, and it can be controlled with a simple command line argument. 
> 
> The bin-packing is only an approximation, but can lead to pretty significant improvements in resource utilization per agent. For example, on a CPU-bound cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to reduce the number of hosts with tasks scheduled on them to just 90%, down from 99.7% (as one would expect from randomization). So if you are running Aurora on elastic computing and paying for machines by the minute/hour, then utilizing this patch _could_ allow you to reduce your server footprint by as much as 10%. 
> 
> The approximation is based on the simple idea that you have the best chance of having perfect bin-packing if you put tasks in the smallest slot available. So if you have a task needing 8 cores and you have an 8 core and 12 core offer available - you'd always want to put the task in the 8 core offer*. By sorting offers in OfferManager during iteration, then a first-fit algorithm is guaranteed to match the smallest possible offer for your task and achieves this.
> 
> * - The correct decision of course depends on the other pending tasks and the other resources available, and more satisfactory results may also need preemption, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md 77376e438bd7af74c364dcd5d1b3e3f1ece2adbf 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java f2296a9d7a88be7e43124370edecfe64415df00f 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
>   src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java d7addc0effb60c196cf339081ad81de541d05385 
>   src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 676d305d257585e53f0a05b359ba7eb11f1b23be 
> 
> 
> Diff: https://reviews.apache.org/r/59480/diff/1/
> 
> 
> Testing
> -------
> 
> This has been scale-tested with production-like workloads and performs well, adding only a few extra seconds total in TaskAssigner when applied to thousands of tasks per minute. 
> 
> There is an  overhead when scheduling tasks that have large resource requirements - as the task assigner will first need to skip offer all the offers with low resources. In a packed cluster, this is where the extra seconds are spent. This could be reduced by just jumping over all the offers we know to be too small, but that decision has to map to the OfferOrder (which adds complexity). That can be addressed in a follow-up review if needed.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 59480: Expose bin-packing options via OfferManager ordering.

Posted by David McLaughlin <da...@dmclaughlin.com>.

> On May 24, 2017, 7:45 p.m., Santhosh Kumar Shanmugham wrote:
> > Only down-side to this approach is that the set of offers (hence agents) that a task might get assigned to is effectively reduced and hence the probability of a task landing on the same broken host increases. Assuming a healthy cluster (with good distribution of slots of different sizes) this might problem might never surface.
> > 
> > Ship it!

This is a great point.  

We don't plan to use this directly for other reasons, but I'll need to clean up the wording here. 

To be clear to anyone who is interested in this feature: I actually wouldn't recommend using any type of stable sort of offers in production with the FirstFitTaskAssigner yet - as all scheduling work will be biased to that bad host as tasks repeatedly fail and Mesos reoffers them. We'd need some sort of failure accural detection mechanism in Aurora (or Mesos) to blacklist bad agents before using this confidently. 

Our plan at Twitter was to use this approach to make score-based scheduling more scalable.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/#review175977
-----------------------------------------------------------


On May 23, 2017, 7:41 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59480/
> -----------------------------------------------------------
> 
> (Updated May 23, 2017, 7:41 a.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch enables scalable, high-performance Scheduler bin-packing using the existing first-fit task assigner, and it can be controlled with a simple command line argument. 
> 
> The bin-packing is only an approximation, but can lead to pretty significant improvements in resource utilization per agent. For example, on a CPU-bound cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to reduce the number of hosts with tasks scheduled on them to just 90%, down from 99.7% (as one would expect from randomization). So if you are running Aurora on elastic computing and paying for machines by the minute/hour, then utilizing this patch _could_ allow you to reduce your server footprint by as much as 10%. 
> 
> The approximation is based on the simple idea that you have the best chance of having perfect bin-packing if you put tasks in the smallest slot available. So if you have a task needing 8 cores and you have an 8 core and 12 core offer available - you'd always want to put the task in the 8 core offer*. By sorting offers in OfferManager during iteration, then a first-fit algorithm is guaranteed to match the smallest possible offer for your task and achieves this.
> 
> * - The correct decision of course depends on the other pending tasks and the other resources available, and more satisfactory results may also need preemption, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md 77376e438bd7af74c364dcd5d1b3e3f1ece2adbf 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java f2296a9d7a88be7e43124370edecfe64415df00f 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
>   src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java d7addc0effb60c196cf339081ad81de541d05385 
>   src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 676d305d257585e53f0a05b359ba7eb11f1b23be 
> 
> 
> Diff: https://reviews.apache.org/r/59480/diff/1/
> 
> 
> Testing
> -------
> 
> This has been scale-tested with production-like workloads and performs well, adding only a few extra seconds total in TaskAssigner when applied to thousands of tasks per minute. 
> 
> There is an  overhead when scheduling tasks that have large resource requirements - as the task assigner will first need to skip offer all the offers with low resources. In a packed cluster, this is where the extra seconds are spent. This could be reduced by just jumping over all the offers we know to be too small, but that decision has to map to the OfferOrder (which adds complexity). That can be addressed in a follow-up review if needed.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 59480: Expose bin-packing options via OfferManager ordering.

Posted by Santhosh Kumar Shanmugham <sa...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/#review175977
-----------------------------------------------------------


Ship it!




Only down-side to this approach is that the set of offers (hence agents) that a task might get assigned to is effectively reduced and hence the probability of a task landing on the same broken host increases. Assuming a healthy cluster (with good distribution of slots of different sizes) this might problem might never surface.

Ship it!

- Santhosh Kumar Shanmugham


On May 23, 2017, 12:41 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59480/
> -----------------------------------------------------------
> 
> (Updated May 23, 2017, 12:41 a.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch enables scalable, high-performance Scheduler bin-packing using the existing first-fit task assigner, and it can be controlled with a simple command line argument. 
> 
> The bin-packing is only an approximation, but can lead to pretty significant improvements in resource utilization per agent. For example, on a CPU-bound cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to reduce the number of hosts with tasks scheduled on them to just 90%, down from 99.7% (as one would expect from randomization). So if you are running Aurora on elastic computing and paying for machines by the minute/hour, then utilizing this patch _could_ allow you to reduce your server footprint by as much as 10%. 
> 
> The approximation is based on the simple idea that you have the best chance of having perfect bin-packing if you put tasks in the smallest slot available. So if you have a task needing 8 cores and you have an 8 core and 12 core offer available - you'd always want to put the task in the 8 core offer*. By sorting offers in OfferManager during iteration, then a first-fit algorithm is guaranteed to match the smallest possible offer for your task and achieves this.
> 
> * - The correct decision of course depends on the other pending tasks and the other resources available, and more satisfactory results may also need preemption, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md 77376e438bd7af74c364dcd5d1b3e3f1ece2adbf 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java f2296a9d7a88be7e43124370edecfe64415df00f 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
>   src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java d7addc0effb60c196cf339081ad81de541d05385 
>   src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 676d305d257585e53f0a05b359ba7eb11f1b23be 
> 
> 
> Diff: https://reviews.apache.org/r/59480/diff/1/
> 
> 
> Testing
> -------
> 
> This has been scale-tested with production-like workloads and performs well, adding only a few extra seconds total in TaskAssigner when applied to thousands of tasks per minute. 
> 
> There is an  overhead when scheduling tasks that have large resource requirements - as the task assigner will first need to skip offer all the offers with low resources. In a packed cluster, this is where the extra seconds are spent. This could be reduced by just jumping over all the offers we know to be too small, but that decision has to map to the OfferOrder (which adds complexity). That can be addressed in a follow-up review if needed.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 59480: Expose bin-packing options via OfferManager ordering.

Posted by Reza Motamedi <re...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/#review175996
-----------------------------------------------------------


Ship it!




Over all looks good to me. I am not sure if it's easy to add costomizability by provinding an ordering at run time!


src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java
Lines 24 (patched)
<https://reviews.apache.org/r/59480/#comment249341>

    I just want to understand what the plan is in the long term. Do we need to add to this in case some other resource becomes tight? What happens when there are new resources such as GPU?


- Reza Motamedi


On May 23, 2017, 7:41 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59480/
> -----------------------------------------------------------
> 
> (Updated May 23, 2017, 7:41 a.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch enables scalable, high-performance Scheduler bin-packing using the existing first-fit task assigner, and it can be controlled with a simple command line argument. 
> 
> The bin-packing is only an approximation, but can lead to pretty significant improvements in resource utilization per agent. For example, on a CPU-bound cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to reduce the number of hosts with tasks scheduled on them to just 90%, down from 99.7% (as one would expect from randomization). So if you are running Aurora on elastic computing and paying for machines by the minute/hour, then utilizing this patch _could_ allow you to reduce your server footprint by as much as 10%. 
> 
> The approximation is based on the simple idea that you have the best chance of having perfect bin-packing if you put tasks in the smallest slot available. So if you have a task needing 8 cores and you have an 8 core and 12 core offer available - you'd always want to put the task in the 8 core offer*. By sorting offers in OfferManager during iteration, then a first-fit algorithm is guaranteed to match the smallest possible offer for your task and achieves this.
> 
> * - The correct decision of course depends on the other pending tasks and the other resources available, and more satisfactory results may also need preemption, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md 77376e438bd7af74c364dcd5d1b3e3f1ece2adbf 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java f2296a9d7a88be7e43124370edecfe64415df00f 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
>   src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java d7addc0effb60c196cf339081ad81de541d05385 
>   src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 676d305d257585e53f0a05b359ba7eb11f1b23be 
> 
> 
> Diff: https://reviews.apache.org/r/59480/diff/1/
> 
> 
> Testing
> -------
> 
> This has been scale-tested with production-like workloads and performs well, adding only a few extra seconds total in TaskAssigner when applied to thousands of tasks per minute. 
> 
> There is an  overhead when scheduling tasks that have large resource requirements - as the task assigner will first need to skip offer all the offers with low resources. In a packed cluster, this is where the extra seconds are spent. This could be reduced by just jumping over all the offers we know to be too small, but that decision has to map to the OfferOrder (which adds complexity). That can be addressed in a follow-up review if needed.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 59480: Expose bin-packing options via OfferManager ordering.

Posted by Stephan Erb <se...@apache.org>.

> On May 26, 2017, 10:23 a.m., Stephan Erb wrote:
> > I have thought about this in the past as well, and I am slightly skeptical if a simple sort by resources is a good approach when using oversubscription. Especially given the goal to run almost all (non-production) tasks as revocable.
> > 
> > Offers do contain both revocable and non-revocable resources, so we would need to find an ordering that is suitable for both:
> > 
> > * Sort by non-revocable resources first: We will pack the production tasks tightly but those are now pretty simple to schedule anyway, as most stuff is running as revocable. As the corresponding downside we don't reduce fragmentation for revocable resources.
> > 
> > * Sorting by revocable resources: offers with the fewest revocable resources will be at the front. Given the nature of the oversubscription, these will be the nodes with the highest load (as those are capped to 0 revocable resources). If we schedule further non-revocable tasks on there, we will either get bad performance or trigger the termination of revocable tasks (depending on our `Estimator` and `QoSController`).
> > 
> > 
> > Alternative first-fit solutions that come to mind (partially inspired by http://shonan.nii.ac.jp/shonan/seminar011/files/2012/02/stein.pdf)
> > 
> > * Sort by agent id. Over time, this should converge to the same cluster state as sorting offers by size. The upside is 
> >   that it would work for revocable and non-revocable resources simultaneously.
> > 
> > * Compute a score per offer and then schedule on the node with the smallerst offer. As an approximation
> >   we could use something like `10 ** (free_ram/total_ram) + 10 ** (free_revocable_ram/total_revocable_ram) ...`. 
> > 
> > * Extend the sort by revocable resources with a special case for 0 remaining revocable resources.
> 
> David McLaughlin wrote:
>     I guess my main question here is: why wouldn't your alternative orderings not just be submitted as patches?
>     
>     
>     
>     Sorting offers is definitely not a panacea and I didn't intend to suggest that. Please see my reply to Santhosh too. I will clean up the README/docs if we decide to ship this.  
>     
>     
>     To clarify further: 
>     
>     What this patch is trying to enable is the ability to influence global ordering that is better than random. So it's basically a chance to say "this is what I would prefer to happen for _all_ tasks in the cluster absent of any other information". E.g. at Twitter CPU is by far our scarcest resource and any bin-packing solution would be based almost solely on that. 
>     
>     But what you really want is a score-based approach that makes smart decisions _per-task_. This is sort of what constraints are, except for bin-packing we don't want to avoid scheduling things when ideal circumstances cannot be found, but instead make sure we pick the best of a bad bunch. The problem is performance. 
>     
>     Our steady-state at Twitter is about 300 calls to TaskAssigner.maybeAssign per minute - this is your crons, failing tasks and other recurring scheduling load. When engineers start deploying, we see that number increase to around 1~2k per minute. 
>     
>     The current algorithm has a worst-case complexity of O(o·n) where o = number of offers and n = number of assign attempts. (Let's assume the filtering done per offer is a constant cost). Worst-case complexity is met when none of the offers match a task which isn't all that uncommon. With 2k attempts and 40k offers, this is already getting close to being a scheduling bottleneck for us, e.g. http://i.imgur.com/C9O8BN8.png.
>     
>     
>     A score-based approach that ranks all offers based on what is ideal for the current task is much more complex. You need to build a sorted structure of all offers for each maybeAssign attempt. So you now have something like a O(o·log(o) · n) algorithm. You can optimize n by increasing the batch size, etc. but there's an upper bound on this (and a bunch of other negative trade-offs). So the only other approach is to limit o, either by maintaining smaller clusters (which is recommended in the Borg paper) or by sampling, where you score a fixed-size number of of offers rather than the entire set. 
>     
>     We're still considering both approaches to reducing the number of offers. But with the latter approach of scoring only a subset of matching offers, what we've ran into is that with a random ordering of all offers (e.g. OfferManager.getOffers on master), the % of improvement in bin-packing vs first-fit and random is completely negligible without blowing up performance by increasing the sample size. One of the big problems is that as soon as a task is scheduled, the previous filtering decision on all offers are now potentially invalidated due to things like host and limit constraints. But when I added a CPU sort on the offers, the results were dramatically improved - only a couple % points away from a perfect bin-packing simulation. So that's the main motivation for this patch. 
>     
>     Unfortunately we're stuck in the short term on this single-writer single-reader architecture for Aurora, so we can't even fan out this workload by increasing number of replicas, etc.

Thanks for all the details! Is the 90% improvement you have stated in the description of the patch independent of the score-based approach? Or does it already require scoring?

I am OK with getting this patch ,erged. We should however make sure it is clear that operators understand the potential downsides (e.g. constant rescheduling on broken nodes).

I would love to see that this is also working for revocable resources, but I can look into that in a follow-up patch :)


- Stephan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/#review176057
-----------------------------------------------------------


On May 23, 2017, 9:41 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59480/
> -----------------------------------------------------------
> 
> (Updated May 23, 2017, 9:41 a.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch enables scalable, high-performance Scheduler bin-packing using the existing first-fit task assigner, and it can be controlled with a simple command line argument. 
> 
> The bin-packing is only an approximation, but can lead to pretty significant improvements in resource utilization per agent. For example, on a CPU-bound cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to reduce the number of hosts with tasks scheduled on them to just 90%, down from 99.7% (as one would expect from randomization). So if you are running Aurora on elastic computing and paying for machines by the minute/hour, then utilizing this patch _could_ allow you to reduce your server footprint by as much as 10%. 
> 
> The approximation is based on the simple idea that you have the best chance of having perfect bin-packing if you put tasks in the smallest slot available. So if you have a task needing 8 cores and you have an 8 core and 12 core offer available - you'd always want to put the task in the 8 core offer*. By sorting offers in OfferManager during iteration, then a first-fit algorithm is guaranteed to match the smallest possible offer for your task and achieves this.
> 
> * - The correct decision of course depends on the other pending tasks and the other resources available, and more satisfactory results may also need preemption, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md 77376e438bd7af74c364dcd5d1b3e3f1ece2adbf 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java f2296a9d7a88be7e43124370edecfe64415df00f 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
>   src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java d7addc0effb60c196cf339081ad81de541d05385 
>   src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 676d305d257585e53f0a05b359ba7eb11f1b23be 
> 
> 
> Diff: https://reviews.apache.org/r/59480/diff/1/
> 
> 
> Testing
> -------
> 
> This has been scale-tested with production-like workloads and performs well, adding only a few extra seconds total in TaskAssigner when applied to thousands of tasks per minute. 
> 
> There is an  overhead when scheduling tasks that have large resource requirements - as the task assigner will first need to skip offer all the offers with low resources. In a packed cluster, this is where the extra seconds are spent. This could be reduced by just jumping over all the offers we know to be too small, but that decision has to map to the OfferOrder (which adds complexity). That can be addressed in a follow-up review if needed.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 59480: Expose bin-packing options via OfferManager ordering.

Posted by David McLaughlin <da...@dmclaughlin.com>.

> On May 26, 2017, 8:23 a.m., Stephan Erb wrote:
> > I have thought about this in the past as well, and I am slightly skeptical if a simple sort by resources is a good approach when using oversubscription. Especially given the goal to run almost all (non-production) tasks as revocable.
> > 
> > Offers do contain both revocable and non-revocable resources, so we would need to find an ordering that is suitable for both:
> > 
> > * Sort by non-revocable resources first: We will pack the production tasks tightly but those are now pretty simple to schedule anyway, as most stuff is running as revocable. As the corresponding downside we don't reduce fragmentation for revocable resources.
> > 
> > * Sorting by revocable resources: offers with the fewest revocable resources will be at the front. Given the nature of the oversubscription, these will be the nodes with the highest load (as those are capped to 0 revocable resources). If we schedule further non-revocable tasks on there, we will either get bad performance or trigger the termination of revocable tasks (depending on our `Estimator` and `QoSController`).
> > 
> > 
> > Alternative first-fit solutions that come to mind (partially inspired by http://shonan.nii.ac.jp/shonan/seminar011/files/2012/02/stein.pdf)
> > 
> > * Sort by agent id. Over time, this should converge to the same cluster state as sorting offers by size. The upside is 
> >   that it would work for revocable and non-revocable resources simultaneously.
> > 
> > * Compute a score per offer and then schedule on the node with the smallerst offer. As an approximation
> >   we could use something like `10 ** (free_ram/total_ram) + 10 ** (free_revocable_ram/total_revocable_ram) ...`. 
> > 
> > * Extend the sort by revocable resources with a special case for 0 remaining revocable resources.

I guess my main question here is: why wouldn't your alternative orderings not just be submitted as patches?


Sorting offers is definitely not a panacea and I didn't intend to suggest that. Please see my reply to Santhosh too. I will clean up the README/docs if we decide to ship this.  


To clarify further: 

What this patch is trying to enable is the ability to influence global ordering that is better than random. So it's basically a chance to say "this is what I would prefer to happen for _all_ tasks in the cluster absent of any other information". E.g. at Twitter CPU is by far our scarcest resource and any bin-packing solution would be based almost solely on that. 

But what you really want is a score-based approach that makes smart decisions _per-task_. This is sort of what constraints are, except for bin-packing we don't want to avoid scheduling things when ideal circumstances cannot be found, but instead make sure we pick the best of a bad bunch. The problem is performance. 

Our steady-state at Twitter is about 300 calls to TaskAssigner.maybeAssign per minute - this is your crons, failing tasks and other recurring scheduling load. When engineers start deploying, we see that number increase to around 1~2k per minute. 

The current algorithm has a worst-case complexity of O(o·n) where o = number of offers and n = number of assign attempts. (Let's assume the filtering done per offer is a constant cost). Worst-case complexity is met when none of the offers match a task which isn't all that uncommon. With 2k attempts and 40k offers, this is already getting close to being a scheduling bottleneck for us, e.g. http://i.imgur.com/C9O8BN8.png.


A score-based approach that ranks all offers based on what is ideal for the current task is much more complex. You need to build a sorted structure of all offers for each maybeAssign attempt. So you now have something like a O(o·log(o) · n) algorithm. You can optimize n by increasing the batch size, etc. but there's an upper bound on this (and a bunch of other negative trade-offs). So the only other approach is to limit o, either by maintaining smaller clusters (which is recommended in the Borg paper) or by sampling, where you score a fixed-size number of of offers rather than the entire set. 

We're still considering both approaches to reducing the number of offers. But with the latter approach of scoring only a subset of matching offers, what we've ran into is that with a random ordering of all offers (e.g. OfferManager.getOffers on master), the % of improvement in bin-packing vs first-fit and random is completely negligible without blowing up performance by increasing the sample size. One of the big problems is that as soon as a task is scheduled, the previous filtering decision on all offers are now potentially invalidated due to things like host and limit constraints. But when I added a CPU sort on the offers, the results were dramatically improved - only a couple % points away from a perfect bin-packing simulation. So that's the main motivation for this patch. 

Unfortunately we're stuck in the short term on this single-writer single-reader architecture for Aurora, so we can't even fan out this workload by increasing number of replicas, etc.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/#review176057
-----------------------------------------------------------


On May 23, 2017, 7:41 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59480/
> -----------------------------------------------------------
> 
> (Updated May 23, 2017, 7:41 a.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch enables scalable, high-performance Scheduler bin-packing using the existing first-fit task assigner, and it can be controlled with a simple command line argument. 
> 
> The bin-packing is only an approximation, but can lead to pretty significant improvements in resource utilization per agent. For example, on a CPU-bound cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to reduce the number of hosts with tasks scheduled on them to just 90%, down from 99.7% (as one would expect from randomization). So if you are running Aurora on elastic computing and paying for machines by the minute/hour, then utilizing this patch _could_ allow you to reduce your server footprint by as much as 10%. 
> 
> The approximation is based on the simple idea that you have the best chance of having perfect bin-packing if you put tasks in the smallest slot available. So if you have a task needing 8 cores and you have an 8 core and 12 core offer available - you'd always want to put the task in the 8 core offer*. By sorting offers in OfferManager during iteration, then a first-fit algorithm is guaranteed to match the smallest possible offer for your task and achieves this.
> 
> * - The correct decision of course depends on the other pending tasks and the other resources available, and more satisfactory results may also need preemption, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md 77376e438bd7af74c364dcd5d1b3e3f1ece2adbf 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java f2296a9d7a88be7e43124370edecfe64415df00f 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
>   src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java d7addc0effb60c196cf339081ad81de541d05385 
>   src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 676d305d257585e53f0a05b359ba7eb11f1b23be 
> 
> 
> Diff: https://reviews.apache.org/r/59480/diff/1/
> 
> 
> Testing
> -------
> 
> This has been scale-tested with production-like workloads and performs well, adding only a few extra seconds total in TaskAssigner when applied to thousands of tasks per minute. 
> 
> There is an  overhead when scheduling tasks that have large resource requirements - as the task assigner will first need to skip offer all the offers with low resources. In a packed cluster, this is where the extra seconds are spent. This could be reduced by just jumping over all the offers we know to be too small, but that decision has to map to the OfferOrder (which adds complexity). That can be addressed in a follow-up review if needed.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 59480: Expose bin-packing options via OfferManager ordering.

Posted by David McLaughlin <da...@dmclaughlin.com>.

> On May 26, 2017, 8:23 a.m., Stephan Erb wrote:
> > I have thought about this in the past as well, and I am slightly skeptical if a simple sort by resources is a good approach when using oversubscription. Especially given the goal to run almost all (non-production) tasks as revocable.
> > 
> > Offers do contain both revocable and non-revocable resources, so we would need to find an ordering that is suitable for both:
> > 
> > * Sort by non-revocable resources first: We will pack the production tasks tightly but those are now pretty simple to schedule anyway, as most stuff is running as revocable. As the corresponding downside we don't reduce fragmentation for revocable resources.
> > 
> > * Sorting by revocable resources: offers with the fewest revocable resources will be at the front. Given the nature of the oversubscription, these will be the nodes with the highest load (as those are capped to 0 revocable resources). If we schedule further non-revocable tasks on there, we will either get bad performance or trigger the termination of revocable tasks (depending on our `Estimator` and `QoSController`).
> > 
> > 
> > Alternative first-fit solutions that come to mind (partially inspired by http://shonan.nii.ac.jp/shonan/seminar011/files/2012/02/stein.pdf)
> > 
> > * Sort by agent id. Over time, this should converge to the same cluster state as sorting offers by size. The upside is 
> >   that it would work for revocable and non-revocable resources simultaneously.
> > 
> > * Compute a score per offer and then schedule on the node with the smallerst offer. As an approximation
> >   we could use something like `10 ** (free_ram/total_ram) + 10 ** (free_revocable_ram/total_revocable_ram) ...`. 
> > 
> > * Extend the sort by revocable resources with a special case for 0 remaining revocable resources.
> 
> David McLaughlin wrote:
>     I guess my main question here is: why wouldn't your alternative orderings not just be submitted as patches?
>     
>     
>     
>     Sorting offers is definitely not a panacea and I didn't intend to suggest that. Please see my reply to Santhosh too. I will clean up the README/docs if we decide to ship this.  
>     
>     
>     To clarify further: 
>     
>     What this patch is trying to enable is the ability to influence global ordering that is better than random. So it's basically a chance to say "this is what I would prefer to happen for _all_ tasks in the cluster absent of any other information". E.g. at Twitter CPU is by far our scarcest resource and any bin-packing solution would be based almost solely on that. 
>     
>     But what you really want is a score-based approach that makes smart decisions _per-task_. This is sort of what constraints are, except for bin-packing we don't want to avoid scheduling things when ideal circumstances cannot be found, but instead make sure we pick the best of a bad bunch. The problem is performance. 
>     
>     Our steady-state at Twitter is about 300 calls to TaskAssigner.maybeAssign per minute - this is your crons, failing tasks and other recurring scheduling load. When engineers start deploying, we see that number increase to around 1~2k per minute. 
>     
>     The current algorithm has a worst-case complexity of O(o·n) where o = number of offers and n = number of assign attempts. (Let's assume the filtering done per offer is a constant cost). Worst-case complexity is met when none of the offers match a task which isn't all that uncommon. With 2k attempts and 40k offers, this is already getting close to being a scheduling bottleneck for us, e.g. http://i.imgur.com/C9O8BN8.png.
>     
>     
>     A score-based approach that ranks all offers based on what is ideal for the current task is much more complex. You need to build a sorted structure of all offers for each maybeAssign attempt. So you now have something like a O(o·log(o) · n) algorithm. You can optimize n by increasing the batch size, etc. but there's an upper bound on this (and a bunch of other negative trade-offs). So the only other approach is to limit o, either by maintaining smaller clusters (which is recommended in the Borg paper) or by sampling, where you score a fixed-size number of of offers rather than the entire set. 
>     
>     We're still considering both approaches to reducing the number of offers. But with the latter approach of scoring only a subset of matching offers, what we've ran into is that with a random ordering of all offers (e.g. OfferManager.getOffers on master), the % of improvement in bin-packing vs first-fit and random is completely negligible without blowing up performance by increasing the sample size. One of the big problems is that as soon as a task is scheduled, the previous filtering decision on all offers are now potentially invalidated due to things like host and limit constraints. But when I added a CPU sort on the offers, the results were dramatically improved - only a couple % points away from a perfect bin-packing simulation. So that's the main motivation for this patch. 
>     
>     Unfortunately we're stuck in the short term on this single-writer single-reader architecture for Aurora, so we can't even fan out this workload by increasing number of replicas, etc.
> 
> Stephan Erb wrote:
>     Thanks for all the details! Is the 90% improvement you have stated in the description of the patch independent of the score-based approach? Or does it already require scoring?
>     
>     I am OK with getting this patch ,erged. We should however make sure it is clear that operators understand the potential downsides (e.g. constant rescheduling on broken nodes).
>     
>     I would love to see that this is also working for revocable resources, but I can look into that in a follow-up patch :)

Yeah, the improvement was with this patch alone. 

I split resources into revocable/non-revocable when ordering, and added a new REVOCABLE_CPU order. RAM/DISK are not shareable resources so revocable and non-revocable seem to be treated the same.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/#review176057
-----------------------------------------------------------


On May 31, 2017, 9:41 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59480/
> -----------------------------------------------------------
> 
> (Updated May 31, 2017, 9:41 p.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch enables scalable, high-performance Scheduler bin-packing using the existing first-fit task assigner, and it can be controlled with a simple command line argument. 
> 
> The bin-packing is only an approximation, but can lead to pretty significant improvements in resource utilization per agent. For example, on a CPU-bound cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to reduce the number of hosts with tasks scheduled on them to just 90%, down from 99.7% (as one would expect from randomization). So if you are running Aurora on elastic computing and paying for machines by the minute/hour, then utilizing this patch _could_ allow you to reduce your server footprint by as much as 10%. 
> 
> The approximation is based on the simple idea that you have the best chance of having perfect bin-packing if you put tasks in the smallest slot available. So if you have a task needing 8 cores and you have an 8 core and 12 core offer available - you'd always want to put the task in the 8 core offer*. By sorting offers in OfferManager during iteration, then a first-fit algorithm is guaranteed to match the smallest possible offer for your task and achieves this.
> 
> * - The correct decision of course depends on the other pending tasks and the other resources available, and more satisfactory results may also need preemption, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md 75b3ddb856dc5d889a9006490f57cc58ee7d82fc 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java b933a5bbc2d1d5edb5e473135fb523a1fe02db35 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrderBuilder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
>   src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java d7addc0effb60c196cf339081ad81de541d05385 
>   src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 676d305d257585e53f0a05b359ba7eb11f1b23be 
> 
> 
> Diff: https://reviews.apache.org/r/59480/diff/2/
> 
> 
> Testing
> -------
> 
> This has been scale-tested with production-like workloads and performs well, adding only a few extra seconds total in TaskAssigner when applied to thousands of tasks per minute. 
> 
> There is an  overhead when scheduling tasks that have large resource requirements - as the task assigner will first need to skip offer all the offers with low resources. In a packed cluster, this is where the extra seconds are spent. This could be reduced by just jumping over all the offers we know to be too small, but that decision has to map to the OfferOrder (which adds complexity). That can be addressed in a follow-up review if needed.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 59480: Expose bin-packing options via OfferManager ordering.

Posted by Stephan Erb <se...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/#review176057
-----------------------------------------------------------



I have thought about this in the past as well, and I am slightly skeptical if a simple sort by resources is a good approach when using oversubscription. Especially given the goal to run almost all (non-production) tasks as revocable.

Offers do contain both revocable and non-revocable resources, so we would need to find an ordering that is suitable for both:

* Sort by non-revocable resources first: We will pack the production tasks tightly but those are now pretty simple to schedule anyway, as most stuff is running as revocable. As the corresponding downside we don't reduce fragmentation for revocable resources.

* Sorting by revocable resources: offers with the fewest revocable resources will be at the front. Given the nature of the oversubscription, these will be the nodes with the highest load (as those are capped to 0 revocable resources). If we schedule further non-revocable tasks on there, we will either get bad performance or trigger the termination of revocable tasks (depending on our `Estimator` and `QoSController`).


Alternative first-fit solutions that come to mind (partially inspired by http://shonan.nii.ac.jp/shonan/seminar011/files/2012/02/stein.pdf)

* Sort by agent id. Over time, this should converge to the same cluster state as sorting offers by size. The upside is 
  that it would work for revocable and non-revocable resources simultaneously.

* Compute a score per offer and then schedule on the node with the smallerst offer. As an approximation
  we could use something like `10 ** (free_ram/total_ram) + 10 ** (free_revocable_ram/total_revocable_ram) ...`. 

* Extend the sort by revocable resources with a special case for 0 remaining revocable resources.

- Stephan Erb


On May 23, 2017, 9:41 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59480/
> -----------------------------------------------------------
> 
> (Updated May 23, 2017, 9:41 a.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch enables scalable, high-performance Scheduler bin-packing using the existing first-fit task assigner, and it can be controlled with a simple command line argument. 
> 
> The bin-packing is only an approximation, but can lead to pretty significant improvements in resource utilization per agent. For example, on a CPU-bound cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to reduce the number of hosts with tasks scheduled on them to just 90%, down from 99.7% (as one would expect from randomization). So if you are running Aurora on elastic computing and paying for machines by the minute/hour, then utilizing this patch _could_ allow you to reduce your server footprint by as much as 10%. 
> 
> The approximation is based on the simple idea that you have the best chance of having perfect bin-packing if you put tasks in the smallest slot available. So if you have a task needing 8 cores and you have an 8 core and 12 core offer available - you'd always want to put the task in the 8 core offer*. By sorting offers in OfferManager during iteration, then a first-fit algorithm is guaranteed to match the smallest possible offer for your task and achieves this.
> 
> * - The correct decision of course depends on the other pending tasks and the other resources available, and more satisfactory results may also need preemption, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md 77376e438bd7af74c364dcd5d1b3e3f1ece2adbf 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java f2296a9d7a88be7e43124370edecfe64415df00f 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
>   src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java d7addc0effb60c196cf339081ad81de541d05385 
>   src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 676d305d257585e53f0a05b359ba7eb11f1b23be 
> 
> 
> Diff: https://reviews.apache.org/r/59480/diff/1/
> 
> 
> Testing
> -------
> 
> This has been scale-tested with production-like workloads and performs well, adding only a few extra seconds total in TaskAssigner when applied to thousands of tasks per minute. 
> 
> There is an  overhead when scheduling tasks that have large resource requirements - as the task assigner will first need to skip offer all the offers with low resources. In a packed cluster, this is where the extra seconds are spent. This could be reduced by just jumping over all the offers we know to be too small, but that decision has to map to the OfferOrder (which adds complexity). That can be addressed in a follow-up review if needed.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 59480: Expose bin-packing options via OfferManager ordering.

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/#review176540
-----------------------------------------------------------


Ship it!




Master (d7425aa) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On May 31, 2017, 9:41 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59480/
> -----------------------------------------------------------
> 
> (Updated May 31, 2017, 9:41 p.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch enables scalable, high-performance Scheduler bin-packing using the existing first-fit task assigner, and it can be controlled with a simple command line argument. 
> 
> The bin-packing is only an approximation, but can lead to pretty significant improvements in resource utilization per agent. For example, on a CPU-bound cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to reduce the number of hosts with tasks scheduled on them to just 90%, down from 99.7% (as one would expect from randomization). So if you are running Aurora on elastic computing and paying for machines by the minute/hour, then utilizing this patch _could_ allow you to reduce your server footprint by as much as 10%. 
> 
> The approximation is based on the simple idea that you have the best chance of having perfect bin-packing if you put tasks in the smallest slot available. So if you have a task needing 8 cores and you have an 8 core and 12 core offer available - you'd always want to put the task in the 8 core offer*. By sorting offers in OfferManager during iteration, then a first-fit algorithm is guaranteed to match the smallest possible offer for your task and achieves this.
> 
> * - The correct decision of course depends on the other pending tasks and the other resources available, and more satisfactory results may also need preemption, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md 75b3ddb856dc5d889a9006490f57cc58ee7d82fc 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java b933a5bbc2d1d5edb5e473135fb523a1fe02db35 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrderBuilder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
>   src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java d7addc0effb60c196cf339081ad81de541d05385 
>   src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 676d305d257585e53f0a05b359ba7eb11f1b23be 
> 
> 
> Diff: https://reviews.apache.org/r/59480/diff/2/
> 
> 
> Testing
> -------
> 
> This has been scale-tested with production-like workloads and performs well, adding only a few extra seconds total in TaskAssigner when applied to thousands of tasks per minute. 
> 
> There is an  overhead when scheduling tasks that have large resource requirements - as the task assigner will first need to skip offer all the offers with low resources. In a packed cluster, this is where the extra seconds are spent. This could be reduced by just jumping over all the offers we know to be too small, but that decision has to map to the OfferOrder (which adds complexity). That can be addressed in a follow-up review if needed.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 59480: Expose bin-packing options via OfferManager ordering.

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/#review176554
-----------------------------------------------------------


Ship it!




Master (d7425aa) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On May 31, 2017, 11:28 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59480/
> -----------------------------------------------------------
> 
> (Updated May 31, 2017, 11:28 p.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch enables scalable, high-performance Scheduler bin-packing using the existing first-fit task assigner, and it can be controlled with a simple command line argument. 
> 
> The bin-packing is only an approximation, but can lead to pretty significant improvements in resource utilization per agent. For example, on a CPU-bound cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to reduce the number of hosts with tasks scheduled on them to just 90%, down from 99.7% (as one would expect from randomization). So if you are running Aurora on elastic computing and paying for machines by the minute/hour, then utilizing this patch _could_ allow you to reduce your server footprint by as much as 10%. 
> 
> The approximation is based on the simple idea that you have the best chance of having perfect bin-packing if you put tasks in the smallest slot available. So if you have a task needing 8 cores and you have an 8 core and 12 core offer available - you'd always want to put the task in the 8 core offer*. By sorting offers in OfferManager during iteration, then a first-fit algorithm is guaranteed to match the smallest possible offer for your task and achieves this.
> 
> * - The correct decision of course depends on the other pending tasks and the other resources available, and more satisfactory results may also need preemption, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md 75b3ddb856dc5d889a9006490f57cc58ee7d82fc 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java b933a5bbc2d1d5edb5e473135fb523a1fe02db35 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrderBuilder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
>   src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java d7addc0effb60c196cf339081ad81de541d05385 
>   src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 676d305d257585e53f0a05b359ba7eb11f1b23be 
> 
> 
> Diff: https://reviews.apache.org/r/59480/diff/4/
> 
> 
> Testing
> -------
> 
> This has been scale-tested with production-like workloads and performs well, adding only a few extra seconds total in TaskAssigner when applied to thousands of tasks per minute. 
> 
> There is an  overhead when scheduling tasks that have large resource requirements - as the task assigner will first need to skip offer all the offers with low resources. In a packed cluster, this is where the extra seconds are spent. This could be reduced by just jumping over all the offers we know to be too small, but that decision has to map to the OfferOrder (which adds complexity). That can be addressed in a follow-up review if needed.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 59480: Expose bin-packing options via OfferManager ordering.

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/
-----------------------------------------------------------

(Updated May 31, 2017, 11:28 p.m.)


Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.


Changes
-------

Added comment for revocable cpu ordering.


Repository: aurora


Description
-------

This patch enables scalable, high-performance Scheduler bin-packing using the existing first-fit task assigner, and it can be controlled with a simple command line argument. 

The bin-packing is only an approximation, but can lead to pretty significant improvements in resource utilization per agent. For example, on a CPU-bound cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to reduce the number of hosts with tasks scheduled on them to just 90%, down from 99.7% (as one would expect from randomization). So if you are running Aurora on elastic computing and paying for machines by the minute/hour, then utilizing this patch _could_ allow you to reduce your server footprint by as much as 10%. 

The approximation is based on the simple idea that you have the best chance of having perfect bin-packing if you put tasks in the smallest slot available. So if you have a task needing 8 cores and you have an 8 core and 12 core offer available - you'd always want to put the task in the 8 core offer*. By sorting offers in OfferManager during iteration, then a first-fit algorithm is guaranteed to match the smallest possible offer for your task and achieves this.

* - The correct decision of course depends on the other pending tasks and the other resources available, and more satisfactory results may also need preemption, etc.


Diffs (updated)
-----

  RELEASE-NOTES.md 75b3ddb856dc5d889a9006490f57cc58ee7d82fc 
  src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java b933a5bbc2d1d5edb5e473135fb523a1fe02db35 
  src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
  src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/offers/OfferOrderBuilder.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
  src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
  src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java d7addc0effb60c196cf339081ad81de541d05385 
  src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 676d305d257585e53f0a05b359ba7eb11f1b23be 


Diff: https://reviews.apache.org/r/59480/diff/4/

Changes: https://reviews.apache.org/r/59480/diff/3-4/


Testing
-------

This has been scale-tested with production-like workloads and performs well, adding only a few extra seconds total in TaskAssigner when applied to thousands of tasks per minute. 

There is an  overhead when scheduling tasks that have large resource requirements - as the task assigner will first need to skip offer all the offers with low resources. In a packed cluster, this is where the extra seconds are spent. This could be reduced by just jumping over all the offers we know to be too small, but that decision has to map to the OfferOrder (which adds complexity). That can be addressed in a follow-up review if needed.


Thanks,

David McLaughlin


Re: Review Request 59480: Expose bin-packing options via OfferManager ordering.

Posted by David McLaughlin <da...@dmclaughlin.com>.

> On May 31, 2017, 10:50 p.m., Stephan Erb wrote:
> > RELEASE-NOTES.md
> > Lines 32 (patched)
> > <https://reviews.apache.org/r/59480/diff/2/?file=1735991#file1735991line32>
> >
> >     This is now slightly out of date.

Fixed.


> On May 31, 2017, 10:50 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/offers/OfferOrderBuilder.java
> > Lines 79 (patched)
> > <https://reviews.apache.org/r/59480/diff/2/?file=1735995#file1735995line79>
> >
> >     This might need a comment to be understandable for a future reader.

Added.


> On May 31, 2017, 10:50 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/offers/OfferOrderBuilder.java
> > Lines 90 (patched)
> > <https://reviews.apache.org/r/59480/diff/2/?file=1735995#file1735995line90>
> >
> >     REVOCABLE_MEMORY is missing in this switch statement.

Removed REVOCABLE_MEMORY entirely.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/#review176542
-----------------------------------------------------------


On May 31, 2017, 11:28 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59480/
> -----------------------------------------------------------
> 
> (Updated May 31, 2017, 11:28 p.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch enables scalable, high-performance Scheduler bin-packing using the existing first-fit task assigner, and it can be controlled with a simple command line argument. 
> 
> The bin-packing is only an approximation, but can lead to pretty significant improvements in resource utilization per agent. For example, on a CPU-bound cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to reduce the number of hosts with tasks scheduled on them to just 90%, down from 99.7% (as one would expect from randomization). So if you are running Aurora on elastic computing and paying for machines by the minute/hour, then utilizing this patch _could_ allow you to reduce your server footprint by as much as 10%. 
> 
> The approximation is based on the simple idea that you have the best chance of having perfect bin-packing if you put tasks in the smallest slot available. So if you have a task needing 8 cores and you have an 8 core and 12 core offer available - you'd always want to put the task in the 8 core offer*. By sorting offers in OfferManager during iteration, then a first-fit algorithm is guaranteed to match the smallest possible offer for your task and achieves this.
> 
> * - The correct decision of course depends on the other pending tasks and the other resources available, and more satisfactory results may also need preemption, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md 75b3ddb856dc5d889a9006490f57cc58ee7d82fc 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java b933a5bbc2d1d5edb5e473135fb523a1fe02db35 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrderBuilder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
>   src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java d7addc0effb60c196cf339081ad81de541d05385 
>   src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 676d305d257585e53f0a05b359ba7eb11f1b23be 
> 
> 
> Diff: https://reviews.apache.org/r/59480/diff/4/
> 
> 
> Testing
> -------
> 
> This has been scale-tested with production-like workloads and performs well, adding only a few extra seconds total in TaskAssigner when applied to thousands of tasks per minute. 
> 
> There is an  overhead when scheduling tasks that have large resource requirements - as the task assigner will first need to skip offer all the offers with low resources. In a packed cluster, this is where the extra seconds are spent. This could be reduced by just jumping over all the offers we know to be too small, but that decision has to map to the OfferOrder (which adds complexity). That can be addressed in a follow-up review if needed.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 59480: Expose bin-packing options via OfferManager ordering.

Posted by Stephan Erb <se...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/#review176542
-----------------------------------------------------------


Ship it!




Pretty awesome. Thanks a lot!


RELEASE-NOTES.md
Lines 32 (patched)
<https://reviews.apache.org/r/59480/#comment249930>

    This is now slightly out of date.



src/main/java/org/apache/aurora/scheduler/offers/OfferOrderBuilder.java
Lines 79 (patched)
<https://reviews.apache.org/r/59480/#comment249929>

    This might need a comment to be understandable for a future reader.



src/main/java/org/apache/aurora/scheduler/offers/OfferOrderBuilder.java
Lines 90 (patched)
<https://reviews.apache.org/r/59480/#comment249931>

    REVOCABLE_MEMORY is missing in this switch statement.


- Stephan Erb


On June 1, 2017, 12:36 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59480/
> -----------------------------------------------------------
> 
> (Updated June 1, 2017, 12:36 a.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch enables scalable, high-performance Scheduler bin-packing using the existing first-fit task assigner, and it can be controlled with a simple command line argument. 
> 
> The bin-packing is only an approximation, but can lead to pretty significant improvements in resource utilization per agent. For example, on a CPU-bound cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to reduce the number of hosts with tasks scheduled on them to just 90%, down from 99.7% (as one would expect from randomization). So if you are running Aurora on elastic computing and paying for machines by the minute/hour, then utilizing this patch _could_ allow you to reduce your server footprint by as much as 10%. 
> 
> The approximation is based on the simple idea that you have the best chance of having perfect bin-packing if you put tasks in the smallest slot available. So if you have a task needing 8 cores and you have an 8 core and 12 core offer available - you'd always want to put the task in the 8 core offer*. By sorting offers in OfferManager during iteration, then a first-fit algorithm is guaranteed to match the smallest possible offer for your task and achieves this.
> 
> * - The correct decision of course depends on the other pending tasks and the other resources available, and more satisfactory results may also need preemption, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md 75b3ddb856dc5d889a9006490f57cc58ee7d82fc 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java b933a5bbc2d1d5edb5e473135fb523a1fe02db35 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrderBuilder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
>   src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java d7addc0effb60c196cf339081ad81de541d05385 
>   src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 676d305d257585e53f0a05b359ba7eb11f1b23be 
> 
> 
> Diff: https://reviews.apache.org/r/59480/diff/3/
> 
> 
> Testing
> -------
> 
> This has been scale-tested with production-like workloads and performs well, adding only a few extra seconds total in TaskAssigner when applied to thousands of tasks per minute. 
> 
> There is an  overhead when scheduling tasks that have large resource requirements - as the task assigner will first need to skip offer all the offers with low resources. In a packed cluster, this is where the extra seconds are spent. This could be reduced by just jumping over all the offers we know to be too small, but that decision has to map to the OfferOrder (which adds complexity). That can be addressed in a follow-up review if needed.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 59480: Expose bin-packing options via OfferManager ordering.

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/
-----------------------------------------------------------

(Updated May 31, 2017, 10:36 p.m.)


Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.


Repository: aurora


Description
-------

This patch enables scalable, high-performance Scheduler bin-packing using the existing first-fit task assigner, and it can be controlled with a simple command line argument. 

The bin-packing is only an approximation, but can lead to pretty significant improvements in resource utilization per agent. For example, on a CPU-bound cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to reduce the number of hosts with tasks scheduled on them to just 90%, down from 99.7% (as one would expect from randomization). So if you are running Aurora on elastic computing and paying for machines by the minute/hour, then utilizing this patch _could_ allow you to reduce your server footprint by as much as 10%. 

The approximation is based on the simple idea that you have the best chance of having perfect bin-packing if you put tasks in the smallest slot available. So if you have a task needing 8 cores and you have an 8 core and 12 core offer available - you'd always want to put the task in the 8 core offer*. By sorting offers in OfferManager during iteration, then a first-fit algorithm is guaranteed to match the smallest possible offer for your task and achieves this.

* - The correct decision of course depends on the other pending tasks and the other resources available, and more satisfactory results may also need preemption, etc.


Diffs (updated)
-----

  RELEASE-NOTES.md 75b3ddb856dc5d889a9006490f57cc58ee7d82fc 
  src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java b933a5bbc2d1d5edb5e473135fb523a1fe02db35 
  src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
  src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/offers/OfferOrderBuilder.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
  src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
  src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java d7addc0effb60c196cf339081ad81de541d05385 
  src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 676d305d257585e53f0a05b359ba7eb11f1b23be 


Diff: https://reviews.apache.org/r/59480/diff/3/

Changes: https://reviews.apache.org/r/59480/diff/2-3/


Testing
-------

This has been scale-tested with production-like workloads and performs well, adding only a few extra seconds total in TaskAssigner when applied to thousands of tasks per minute. 

There is an  overhead when scheduling tasks that have large resource requirements - as the task assigner will first need to skip offer all the offers with low resources. In a packed cluster, this is where the extra seconds are spent. This could be reduced by just jumping over all the offers we know to be too small, but that decision has to map to the OfferOrder (which adds complexity). That can be addressed in a follow-up review if needed.


Thanks,

David McLaughlin


Re: Review Request 59480: Expose bin-packing options via OfferManager ordering.

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/
-----------------------------------------------------------

(Updated May 31, 2017, 9:41 p.m.)


Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.


Changes
-------

Added support for configurable compositions of ordering: e.g. CPU,MEMORY instead of CPU_MEMORY. 

Added support for revocable resources.


Repository: aurora


Description
-------

This patch enables scalable, high-performance Scheduler bin-packing using the existing first-fit task assigner, and it can be controlled with a simple command line argument. 

The bin-packing is only an approximation, but can lead to pretty significant improvements in resource utilization per agent. For example, on a CPU-bound cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to reduce the number of hosts with tasks scheduled on them to just 90%, down from 99.7% (as one would expect from randomization). So if you are running Aurora on elastic computing and paying for machines by the minute/hour, then utilizing this patch _could_ allow you to reduce your server footprint by as much as 10%. 

The approximation is based on the simple idea that you have the best chance of having perfect bin-packing if you put tasks in the smallest slot available. So if you have a task needing 8 cores and you have an 8 core and 12 core offer available - you'd always want to put the task in the 8 core offer*. By sorting offers in OfferManager during iteration, then a first-fit algorithm is guaranteed to match the smallest possible offer for your task and achieves this.

* - The correct decision of course depends on the other pending tasks and the other resources available, and more satisfactory results may also need preemption, etc.


Diffs (updated)
-----

  RELEASE-NOTES.md 75b3ddb856dc5d889a9006490f57cc58ee7d82fc 
  src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java b933a5bbc2d1d5edb5e473135fb523a1fe02db35 
  src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
  src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/offers/OfferOrderBuilder.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
  src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
  src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java d7addc0effb60c196cf339081ad81de541d05385 
  src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 676d305d257585e53f0a05b359ba7eb11f1b23be 


Diff: https://reviews.apache.org/r/59480/diff/2/

Changes: https://reviews.apache.org/r/59480/diff/1-2/


Testing
-------

This has been scale-tested with production-like workloads and performs well, adding only a few extra seconds total in TaskAssigner when applied to thousands of tasks per minute. 

There is an  overhead when scheduling tasks that have large resource requirements - as the task assigner will first need to skip offer all the offers with low resources. In a packed cluster, this is where the extra seconds are spent. This could be reduced by just jumping over all the offers we know to be too small, but that decision has to map to the OfferOrder (which adds complexity). That can be addressed in a follow-up review if needed.


Thanks,

David McLaughlin