You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@aurora.apache.org by Zameer Manji <zm...@apache.org> on 2017/06/06 19:41:26 UTC

Review Request 59853: Process rescinds in the same thread pool as offers.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59853/
-----------------------------------------------------------

Review request for Aurora, David McLaughlin and Santhosh Kumar Shanmugham.


Bugs: AURORA-1933
    https://issues.apache.org/jira/browse/AURORA-1933


Repository: aurora


Description
-------

In a a production environment I was able to observe the following:
```
I0606 00:31:32.510 [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
I0606 00:31:32.903 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
I0606 00:31:34.815 [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH]
```

Notice that the offer rescind was processed before the actual offer. This is
possible because there is a race in the `MesosCallbackHandlerImpl`. The offer is
processed in the executor (to prevent blocking) and the rescind is handled
directly. This means the offer procecssing thread (SchedulerImpl-0) is racing
against the callback thread (Thread-77638).

In normal operation, there will be seconds to minutes between a rescind and an
offer, but in some cases an offer can be rescinded very quickly in clusters that
use oversubscription modules.

To fix this, we move the rescind processing into the same executor as the offer
processing to ensure they are processed in the order they are recived. Without
fixing this, the rescinded offer exists in the offer manager and can be used
later to launch a task. This task will immediately fail to launch because the
offer is invalid.

In this patch, I have also added a metric and logging to record when we fail to
remove an offer from the offer manager, and cleaned up the logging to allow
operators to see when an offer was recieved. With this logging, an operator can
grep for the offer id and see the entire lifecycle of the offer in the
scheduler.


Diffs
-----

  src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 5a5281aeaea1e2a4e0eab67069605838ee809c6c 
  src/main/java/org/apache/aurora/scheduler/mesos/VersionedSchedulerDriverService.java 5e86504c70083065278864e6ab1cc85c83a45a28 
  src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 17e577b069df9232d57cde171a078d9f6db707ea 
  src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 97febf25cea2024e0ca43366b3d4578e67734884 


Diff: https://reviews.apache.org/r/59853/diff/1/


Testing
-------


Thanks,

Zameer Manji


Re: Review Request 59853: Process rescinds in the same thread pool as offers.

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59853/#review177086
-----------------------------------------------------------


Ship it!




Ship It!

- David McLaughlin


On June 6, 2017, 7:42 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59853/
> -----------------------------------------------------------
> 
> (Updated June 6, 2017, 7:42 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin and Santhosh Kumar Shanmugham.
> 
> 
> Bugs: AURORA-1933
>     https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> In a a production environment I was able to observe the following:
> ```
> I0606 00:31:32.510 [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> I0606 00:31:32.903 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> I0606 00:31:34.815 [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH]
> ```
> 
> Notice that the offer rescind was processed before the actual offer. This is
> possible because there is a race in the `MesosCallbackHandlerImpl`. The offer is
> processed in the executor (to prevent blocking) and the rescind is handled
> directly. This means the offer procecssing thread (`SchedulerImpl-0`) is racing
> against the callback thread (`Thread-77638`).
> 
> In normal operation, there will be seconds to minutes between a rescind and an
> offer, but in some cases an offer can be rescinded very quickly in clusters that
> use oversubscription modules.
> 
> To fix this, we move the rescind processing into the same executor as the offer
> processing to ensure they are processed in the order they are recived. Without
> fixing this, the rescinded offer exists in the offer manager and can be used
> later to launch a task. This task will immediately fail to launch because the
> offer is invalid.
> 
> In this patch, I have also added a metric and logging to record when we fail to
> remove an offer from the offer manager, and cleaned up the logging to allow
> operators to see when an offer was recieved. With this logging, an operator can
> grep for the offer id and see the entire lifecycle of the offer in the
> scheduler.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 5a5281aeaea1e2a4e0eab67069605838ee809c6c 
>   src/main/java/org/apache/aurora/scheduler/mesos/VersionedSchedulerDriverService.java 5e86504c70083065278864e6ab1cc85c83a45a28 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 17e577b069df9232d57cde171a078d9f6db707ea 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 97febf25cea2024e0ca43366b3d4578e67734884 
> 
> 
> Diff: https://reviews.apache.org/r/59853/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 59853: Process rescinds in the same thread pool as offers.

Posted by David McLaughlin <da...@dmclaughlin.com>.

> On June 28, 2017, 5:22 p.m., David McLaughlin wrote:
> > This change appears to have caused adverse behavior when we deployed it to production. We are seeing more tasks lost due to being launched in invalid offers.
> > 
> > I think we should revert this change until we root cause it.
> 
> Zameer Manji wrote:
>     This change caused less lost tasks when deployed to production for me. Have you isolated it 100% to this change? What do you see for the `offer_cancel_failures` metric?

Like 99.9% sure it's this commit. 

We'll revert this commit on one cluster and compare the differences.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59853/#review179131
-----------------------------------------------------------


On June 6, 2017, 8:56 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59853/
> -----------------------------------------------------------
> 
> (Updated June 6, 2017, 8:56 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin and Santhosh Kumar Shanmugham.
> 
> 
> Bugs: AURORA-1933
>     https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> In a a production environment I was able to observe the following:
> ```
> I0606 00:31:32.510 [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> I0606 00:31:32.903 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> I0606 00:31:34.815 [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH]
> ```
> 
> Notice that the offer rescind was processed before the actual offer. This is
> possible because there is a race in the `MesosCallbackHandlerImpl`. The offer is
> processed in the executor (to prevent blocking) and the rescind is handled
> directly. This means the offer procecssing thread (`SchedulerImpl-0`) is racing
> against the callback thread (`Thread-77638`).
> 
> In normal operation, there will be seconds to minutes between a rescind and an
> offer, but in some cases an offer can be rescinded very quickly in clusters that
> use oversubscription modules.
> 
> To fix this, we move the rescind processing into the same executor as the offer
> processing to ensure they are processed in the order they are received. Without
> fixing this, the rescinded offer exists in the offer manager and can be used
> later to launch a task. This task will immediately fail to launch because the
> offer is invalid.
> 
> In this patch, I have also added a metric and logging to record when we fail to
> remove an offer from the offer manager, and cleaned up the logging to allow
> operators to see when an offer was recieved. With this logging, an operator can
> grep for the offer id and see the entire lifecycle of the offer in the
> scheduler.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 5a5281aeaea1e2a4e0eab67069605838ee809c6c 
>   src/main/java/org/apache/aurora/scheduler/mesos/VersionedSchedulerDriverService.java 5e86504c70083065278864e6ab1cc85c83a45a28 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 17e577b069df9232d57cde171a078d9f6db707ea 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 97febf25cea2024e0ca43366b3d4578e67734884 
> 
> 
> Diff: https://reviews.apache.org/r/59853/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 59853: Process rescinds in the same thread pool as offers.

Posted by Zameer Manji <zm...@apache.org>.

> On June 28, 2017, 10:22 a.m., David McLaughlin wrote:
> > This change appears to have caused adverse behavior when we deployed it to production. We are seeing more tasks lost due to being launched in invalid offers.
> > 
> > I think we should revert this change until we root cause it.

This change caused less lost tasks when deployed to production for me. Have you isolated it 100% to this change? What do you see for the `offer_cancel_failures` metric?


- Zameer


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59853/#review179131
-----------------------------------------------------------


On June 6, 2017, 1:56 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59853/
> -----------------------------------------------------------
> 
> (Updated June 6, 2017, 1:56 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin and Santhosh Kumar Shanmugham.
> 
> 
> Bugs: AURORA-1933
>     https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> In a a production environment I was able to observe the following:
> ```
> I0606 00:31:32.510 [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> I0606 00:31:32.903 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> I0606 00:31:34.815 [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH]
> ```
> 
> Notice that the offer rescind was processed before the actual offer. This is
> possible because there is a race in the `MesosCallbackHandlerImpl`. The offer is
> processed in the executor (to prevent blocking) and the rescind is handled
> directly. This means the offer procecssing thread (`SchedulerImpl-0`) is racing
> against the callback thread (`Thread-77638`).
> 
> In normal operation, there will be seconds to minutes between a rescind and an
> offer, but in some cases an offer can be rescinded very quickly in clusters that
> use oversubscription modules.
> 
> To fix this, we move the rescind processing into the same executor as the offer
> processing to ensure they are processed in the order they are received. Without
> fixing this, the rescinded offer exists in the offer manager and can be used
> later to launch a task. This task will immediately fail to launch because the
> offer is invalid.
> 
> In this patch, I have also added a metric and logging to record when we fail to
> remove an offer from the offer manager, and cleaned up the logging to allow
> operators to see when an offer was recieved. With this logging, an operator can
> grep for the offer id and see the entire lifecycle of the offer in the
> scheduler.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 5a5281aeaea1e2a4e0eab67069605838ee809c6c 
>   src/main/java/org/apache/aurora/scheduler/mesos/VersionedSchedulerDriverService.java 5e86504c70083065278864e6ab1cc85c83a45a28 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 17e577b069df9232d57cde171a078d9f6db707ea 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 97febf25cea2024e0ca43366b3d4578e67734884 
> 
> 
> Diff: https://reviews.apache.org/r/59853/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 59853: Process rescinds in the same thread pool as offers.

Posted by Zameer Manji <zm...@apache.org>.

> On June 28, 2017, 10:22 a.m., David McLaughlin wrote:
> > This change appears to have caused adverse behavior when we deployed it to production. We are seeing more tasks lost due to being launched in invalid offers.
> > 
> > I think we should revert this change until we root cause it.
> 
> Zameer Manji wrote:
>     This change caused less lost tasks when deployed to production for me. Have you isolated it 100% to this change? What do you see for the `offer_cancel_failures` metric?
> 
> David McLaughlin wrote:
>     Like 99.9% sure it's this commit. 
>     
>     We'll revert this commit on one cluster and compare the differences.

Note that this code fixes a clear race condition, so if this is causing a problem in production I feel it is exposing another problem more so than causing another one.

I would like to repeat that the log snippet in the description shows that a rescind can be processed before the offer it self, which leaves an invalid offer id in the offer manager. This causes lost tasks with the reason of invalid offer. Reverting this change keeps you open to this problem.

One thing that you should do, is check the logs for offer ids that are rejected for being invalid and confirm the times that mesos sent the offer and send the rescind.

It is possible/likely that the rescind came right after the offer which exposes a problem in Mesos.

I think reverting this change is a bad idea based on the evidence that it is fixing a clear problem.


- Zameer


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59853/#review179131
-----------------------------------------------------------


On June 6, 2017, 1:56 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59853/
> -----------------------------------------------------------
> 
> (Updated June 6, 2017, 1:56 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin and Santhosh Kumar Shanmugham.
> 
> 
> Bugs: AURORA-1933
>     https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> In a a production environment I was able to observe the following:
> ```
> I0606 00:31:32.510 [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> I0606 00:31:32.903 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> I0606 00:31:34.815 [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH]
> ```
> 
> Notice that the offer rescind was processed before the actual offer. This is
> possible because there is a race in the `MesosCallbackHandlerImpl`. The offer is
> processed in the executor (to prevent blocking) and the rescind is handled
> directly. This means the offer procecssing thread (`SchedulerImpl-0`) is racing
> against the callback thread (`Thread-77638`).
> 
> In normal operation, there will be seconds to minutes between a rescind and an
> offer, but in some cases an offer can be rescinded very quickly in clusters that
> use oversubscription modules.
> 
> To fix this, we move the rescind processing into the same executor as the offer
> processing to ensure they are processed in the order they are received. Without
> fixing this, the rescinded offer exists in the offer manager and can be used
> later to launch a task. This task will immediately fail to launch because the
> offer is invalid.
> 
> In this patch, I have also added a metric and logging to record when we fail to
> remove an offer from the offer manager, and cleaned up the logging to allow
> operators to see when an offer was recieved. With this logging, an operator can
> grep for the offer id and see the entire lifecycle of the offer in the
> scheduler.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 5a5281aeaea1e2a4e0eab67069605838ee809c6c 
>   src/main/java/org/apache/aurora/scheduler/mesos/VersionedSchedulerDriverService.java 5e86504c70083065278864e6ab1cc85c83a45a28 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 17e577b069df9232d57cde171a078d9f6db707ea 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 97febf25cea2024e0ca43366b3d4578e67734884 
> 
> 
> Diff: https://reviews.apache.org/r/59853/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 59853: Process rescinds in the same thread pool as offers.

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59853/#review179131
-----------------------------------------------------------



This change appears to have caused adverse behavior when we deployed it to production. We are seeing more tasks lost due to being launched in invalid offers.

I think we should revert this change until we root cause it.

- David McLaughlin


On June 6, 2017, 8:56 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59853/
> -----------------------------------------------------------
> 
> (Updated June 6, 2017, 8:56 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin and Santhosh Kumar Shanmugham.
> 
> 
> Bugs: AURORA-1933
>     https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> In a a production environment I was able to observe the following:
> ```
> I0606 00:31:32.510 [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> I0606 00:31:32.903 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> I0606 00:31:34.815 [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH]
> ```
> 
> Notice that the offer rescind was processed before the actual offer. This is
> possible because there is a race in the `MesosCallbackHandlerImpl`. The offer is
> processed in the executor (to prevent blocking) and the rescind is handled
> directly. This means the offer procecssing thread (`SchedulerImpl-0`) is racing
> against the callback thread (`Thread-77638`).
> 
> In normal operation, there will be seconds to minutes between a rescind and an
> offer, but in some cases an offer can be rescinded very quickly in clusters that
> use oversubscription modules.
> 
> To fix this, we move the rescind processing into the same executor as the offer
> processing to ensure they are processed in the order they are received. Without
> fixing this, the rescinded offer exists in the offer manager and can be used
> later to launch a task. This task will immediately fail to launch because the
> offer is invalid.
> 
> In this patch, I have also added a metric and logging to record when we fail to
> remove an offer from the offer manager, and cleaned up the logging to allow
> operators to see when an offer was recieved. With this logging, an operator can
> grep for the offer id and see the entire lifecycle of the offer in the
> scheduler.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 5a5281aeaea1e2a4e0eab67069605838ee809c6c 
>   src/main/java/org/apache/aurora/scheduler/mesos/VersionedSchedulerDriverService.java 5e86504c70083065278864e6ab1cc85c83a45a28 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 17e577b069df9232d57cde171a078d9f6db707ea 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 97febf25cea2024e0ca43366b3d4578e67734884 
> 
> 
> Diff: https://reviews.apache.org/r/59853/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 59853: Process rescinds in the same thread pool as offers.

Posted by Stephan Erb <se...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59853/#review177099
-----------------------------------------------------------


Ship it!




Ship It!

- Stephan Erb


On June 6, 2017, 10:56 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59853/
> -----------------------------------------------------------
> 
> (Updated June 6, 2017, 10:56 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin and Santhosh Kumar Shanmugham.
> 
> 
> Bugs: AURORA-1933
>     https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> In a a production environment I was able to observe the following:
> ```
> I0606 00:31:32.510 [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> I0606 00:31:32.903 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> I0606 00:31:34.815 [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH]
> ```
> 
> Notice that the offer rescind was processed before the actual offer. This is
> possible because there is a race in the `MesosCallbackHandlerImpl`. The offer is
> processed in the executor (to prevent blocking) and the rescind is handled
> directly. This means the offer procecssing thread (`SchedulerImpl-0`) is racing
> against the callback thread (`Thread-77638`).
> 
> In normal operation, there will be seconds to minutes between a rescind and an
> offer, but in some cases an offer can be rescinded very quickly in clusters that
> use oversubscription modules.
> 
> To fix this, we move the rescind processing into the same executor as the offer
> processing to ensure they are processed in the order they are received. Without
> fixing this, the rescinded offer exists in the offer manager and can be used
> later to launch a task. This task will immediately fail to launch because the
> offer is invalid.
> 
> In this patch, I have also added a metric and logging to record when we fail to
> remove an offer from the offer manager, and cleaned up the logging to allow
> operators to see when an offer was recieved. With this logging, an operator can
> grep for the offer id and see the entire lifecycle of the offer in the
> scheduler.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 5a5281aeaea1e2a4e0eab67069605838ee809c6c 
>   src/main/java/org/apache/aurora/scheduler/mesos/VersionedSchedulerDriverService.java 5e86504c70083065278864e6ab1cc85c83a45a28 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 17e577b069df9232d57cde171a078d9f6db707ea 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 97febf25cea2024e0ca43366b3d4578e67734884 
> 
> 
> Diff: https://reviews.apache.org/r/59853/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 59853: Process rescinds in the same thread pool as offers.

Posted by Zameer Manji <zm...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59853/
-----------------------------------------------------------

(Updated June 6, 2017, 1:56 p.m.)


Review request for Aurora, David McLaughlin and Santhosh Kumar Shanmugham.


Bugs: AURORA-1933
    https://issues.apache.org/jira/browse/AURORA-1933


Repository: aurora


Description (updated)
-------

In a a production environment I was able to observe the following:
```
I0606 00:31:32.510 [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
I0606 00:31:32.903 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
I0606 00:31:34.815 [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH]
```

Notice that the offer rescind was processed before the actual offer. This is
possible because there is a race in the `MesosCallbackHandlerImpl`. The offer is
processed in the executor (to prevent blocking) and the rescind is handled
directly. This means the offer procecssing thread (`SchedulerImpl-0`) is racing
against the callback thread (`Thread-77638`).

In normal operation, there will be seconds to minutes between a rescind and an
offer, but in some cases an offer can be rescinded very quickly in clusters that
use oversubscription modules.

To fix this, we move the rescind processing into the same executor as the offer
processing to ensure they are processed in the order they are received. Without
fixing this, the rescinded offer exists in the offer manager and can be used
later to launch a task. This task will immediately fail to launch because the
offer is invalid.

In this patch, I have also added a metric and logging to record when we fail to
remove an offer from the offer manager, and cleaned up the logging to allow
operators to see when an offer was recieved. With this logging, an operator can
grep for the offer id and see the entire lifecycle of the offer in the
scheduler.


Diffs
-----

  src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 5a5281aeaea1e2a4e0eab67069605838ee809c6c 
  src/main/java/org/apache/aurora/scheduler/mesos/VersionedSchedulerDriverService.java 5e86504c70083065278864e6ab1cc85c83a45a28 
  src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 17e577b069df9232d57cde171a078d9f6db707ea 
  src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 97febf25cea2024e0ca43366b3d4578e67734884 


Diff: https://reviews.apache.org/r/59853/diff/1/


Testing
-------


Thanks,

Zameer Manji


Re: Review Request 59853: Process rescinds in the same thread pool as offers.

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59853/#review177079
-----------------------------------------------------------


Ship it!




Master (2cbaeec) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On June 6, 2017, 7:42 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59853/
> -----------------------------------------------------------
> 
> (Updated June 6, 2017, 7:42 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin and Santhosh Kumar Shanmugham.
> 
> 
> Bugs: AURORA-1933
>     https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> In a a production environment I was able to observe the following:
> ```
> I0606 00:31:32.510 [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> I0606 00:31:32.903 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> I0606 00:31:34.815 [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH]
> ```
> 
> Notice that the offer rescind was processed before the actual offer. This is
> possible because there is a race in the `MesosCallbackHandlerImpl`. The offer is
> processed in the executor (to prevent blocking) and the rescind is handled
> directly. This means the offer procecssing thread (`SchedulerImpl-0`) is racing
> against the callback thread (`Thread-77638`).
> 
> In normal operation, there will be seconds to minutes between a rescind and an
> offer, but in some cases an offer can be rescinded very quickly in clusters that
> use oversubscription modules.
> 
> To fix this, we move the rescind processing into the same executor as the offer
> processing to ensure they are processed in the order they are recived. Without
> fixing this, the rescinded offer exists in the offer manager and can be used
> later to launch a task. This task will immediately fail to launch because the
> offer is invalid.
> 
> In this patch, I have also added a metric and logging to record when we fail to
> remove an offer from the offer manager, and cleaned up the logging to allow
> operators to see when an offer was recieved. With this logging, an operator can
> grep for the offer id and see the entire lifecycle of the offer in the
> scheduler.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 5a5281aeaea1e2a4e0eab67069605838ee809c6c 
>   src/main/java/org/apache/aurora/scheduler/mesos/VersionedSchedulerDriverService.java 5e86504c70083065278864e6ab1cc85c83a45a28 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 17e577b069df9232d57cde171a078d9f6db707ea 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 97febf25cea2024e0ca43366b3d4578e67734884 
> 
> 
> Diff: https://reviews.apache.org/r/59853/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 59853: Process rescinds in the same thread pool as offers.

Posted by Santhosh Kumar Shanmugham <sa...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59853/#review177080
-----------------------------------------------------------


Ship it!




- Santhosh Kumar Shanmugham


On June 6, 2017, 12:42 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59853/
> -----------------------------------------------------------
> 
> (Updated June 6, 2017, 12:42 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin and Santhosh Kumar Shanmugham.
> 
> 
> Bugs: AURORA-1933
>     https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> In a a production environment I was able to observe the following:
> ```
> I0606 00:31:32.510 [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> I0606 00:31:32.903 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> I0606 00:31:34.815 [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH]
> ```
> 
> Notice that the offer rescind was processed before the actual offer. This is
> possible because there is a race in the `MesosCallbackHandlerImpl`. The offer is
> processed in the executor (to prevent blocking) and the rescind is handled
> directly. This means the offer procecssing thread (`SchedulerImpl-0`) is racing
> against the callback thread (`Thread-77638`).
> 
> In normal operation, there will be seconds to minutes between a rescind and an
> offer, but in some cases an offer can be rescinded very quickly in clusters that
> use oversubscription modules.
> 
> To fix this, we move the rescind processing into the same executor as the offer
> processing to ensure they are processed in the order they are recived. Without
> fixing this, the rescinded offer exists in the offer manager and can be used
> later to launch a task. This task will immediately fail to launch because the
> offer is invalid.
> 
> In this patch, I have also added a metric and logging to record when we fail to
> remove an offer from the offer manager, and cleaned up the logging to allow
> operators to see when an offer was recieved. With this logging, an operator can
> grep for the offer id and see the entire lifecycle of the offer in the
> scheduler.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 5a5281aeaea1e2a4e0eab67069605838ee809c6c 
>   src/main/java/org/apache/aurora/scheduler/mesos/VersionedSchedulerDriverService.java 5e86504c70083065278864e6ab1cc85c83a45a28 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 17e577b069df9232d57cde171a078d9f6db707ea 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 97febf25cea2024e0ca43366b3d4578e67734884 
> 
> 
> Diff: https://reviews.apache.org/r/59853/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Re: Review Request 59853: Process rescinds in the same thread pool as offers.

Posted by Zameer Manji <zm...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59853/
-----------------------------------------------------------

(Updated June 6, 2017, 12:42 p.m.)


Review request for Aurora, David McLaughlin and Santhosh Kumar Shanmugham.


Bugs: AURORA-1933
    https://issues.apache.org/jira/browse/AURORA-1933


Repository: aurora


Description (updated)
-------

In a a production environment I was able to observe the following:
```
I0606 00:31:32.510 [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
I0606 00:31:32.903 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
I0606 00:31:34.815 [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH]
```

Notice that the offer rescind was processed before the actual offer. This is
possible because there is a race in the `MesosCallbackHandlerImpl`. The offer is
processed in the executor (to prevent blocking) and the rescind is handled
directly. This means the offer procecssing thread (`SchedulerImpl-0`) is racing
against the callback thread (`Thread-77638`).

In normal operation, there will be seconds to minutes between a rescind and an
offer, but in some cases an offer can be rescinded very quickly in clusters that
use oversubscription modules.

To fix this, we move the rescind processing into the same executor as the offer
processing to ensure they are processed in the order they are recived. Without
fixing this, the rescinded offer exists in the offer manager and can be used
later to launch a task. This task will immediately fail to launch because the
offer is invalid.

In this patch, I have also added a metric and logging to record when we fail to
remove an offer from the offer manager, and cleaned up the logging to allow
operators to see when an offer was recieved. With this logging, an operator can
grep for the offer id and see the entire lifecycle of the offer in the
scheduler.


Diffs
-----

  src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 5a5281aeaea1e2a4e0eab67069605838ee809c6c 
  src/main/java/org/apache/aurora/scheduler/mesos/VersionedSchedulerDriverService.java 5e86504c70083065278864e6ab1cc85c83a45a28 
  src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 17e577b069df9232d57cde171a078d9f6db707ea 
  src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 97febf25cea2024e0ca43366b3d4578e67734884 


Diff: https://reviews.apache.org/r/59853/diff/1/


Testing
-------


Thanks,

Zameer Manji