You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@aurora.apache.org by Jordan Ly <jo...@gmail.com> on 2017/08/22 00:37:29 UTC

Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/
-----------------------------------------------------------

Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.


Bugs: AURORA-1945
    https://issues.apache.org/jira/browse/AURORA-1945


Repository: aurora


Description
-------

The current race condition for offers is possible:
```
1. Scheduler receives an offer and adds it to the executor queue for processing.
2. The executor processes the offer and adds it to the HostOffers list.
3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
```
The following logs show this in action:

Mesos:
```
I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
```
Aurora:
```
I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
``` 
I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected.


Diffs
-----

  src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
  src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
  src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
  src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
  src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
  src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
  src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 


Diff: https://reviews.apache.org/r/61804/diff/1/


Testing
-------

`./gradlew test`

Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.

I will verify this patch on a live cluster as well before submitting.


Thanks,

Jordan Ly


Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

Posted by Santhosh Kumar Shanmugham <sa...@gmail.com>.

> On Aug. 21, 2017, 9:33 p.m., Santhosh Kumar Shanmugham wrote:
> > src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java
> > Line 237 (original), 247-248 (patched)
> > <https://reviews.apache.org/r/61804/diff/2/?file=1801326#file1801326line252>
> >
> >     Maybe add a metric to track the unban-s?
> >     
> >     Do we really need to `unban` the offer before `cancelling` it? Can we fold `unbanOffer` into the `cancelOffer` method?
> 
> Jordan Ly wrote:
>     Added metrics to check bans and unbans.
>     
>     I would prefer not to fold `unbanOffer` into the `cancelOffer` method as I believe they are distinct enough to leave them as separate calls. I can see use cases where a developer would call one but not the other.
> 
> David McLaughlin wrote:
>     Is cancelOffer currently being called anywhere were we wouldn't want to then ban it if it returns false?
> 
> Santhosh Kumar Shanmugham wrote:
>     I do not think we should be developing code for future usecases that may or may not arise.
>     
>     I see `cancelOffer` as the equivalent of clearing a particular offer. In that case we should clean up all the state related to that particular offer. Leaving it in the banned set means, we are expecting developers to do the clean up themselves.
> 
> Jordan Ly wrote:
>     That makes sense. Is it possible for an offer to not be sent to the framework, but a rescind to be sent? In this case, I don't think cancelOffer would have enough signal to determine whether to ban or unban an offer, it would return false both times as the offer will never be in HostOffers.
> 
> Jordan Ly wrote:
>     Please ignore the above, RB posted my unrevised comment :(
>     
>     What I originally meant to post:
>     ```
>     I do not see it as developing potentially unneeded code but rather maintaining the abstraction around OfferManager. I do not think a developer would know to call `cancelOffer` twice in order to undo some hidden state for a particular implementation -- I would rather make it explicit. On the other hand, this could also mean that the abstraction I am looking to create is flawed and can be done better/simpler.
>     
>     I definitely understand where you are coming from. We do not want to add superfluous code that future developers must think about and maintain. Is it possible for an offer to not be sent to the framework, but a rescind to be sent? In this case, I don't think cancelOffer would have enough signal to determine whether to ban or unban an offer, it would return false both times as the offer will never be in HostOffers. However, if not, it is definitely possible to do this refactor.
>     ```

We are not relying on the return value during the second `cancelOffer` call.

We already have `banOfferForTaskGroup` without a corresponding `unbanOfferForTaskGroup` and we are now setting a different pattern.


- Santhosh Kumar


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/#review183442
-----------------------------------------------------------


On Aug. 22, 2017, 1:33 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61804/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2017, 1:33 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1945
>     https://issues.apache.org/jira/browse/AURORA-1945
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The current race condition for offers is possible:
> ```
> 1. Scheduler receives an offer and adds it to the executor queue for processing.
> 2. The executor processes the offer and adds it to the HostOffers list.
> 3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
> 4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
> ```
> The following logs show this in action:
> 
> Mesos:
> ```
> I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
> W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
> W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
> I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
> ```
> Aurora:
> ```
> I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
> I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
> I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
> I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
> W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
> ``` 
> I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Diffs
> -----
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
>   src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
>   src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 
> 
> 
> Diff: https://reviews.apache.org/r/61804/diff/4/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.
> 
> I will verify this patch on a live cluster as well before submitting.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

Posted by Jordan Ly <jo...@gmail.com>.

> On Aug. 22, 2017, 4:33 a.m., Santhosh Kumar Shanmugham wrote:
> > src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java
> > Line 237 (original), 247-248 (patched)
> > <https://reviews.apache.org/r/61804/diff/2/?file=1801326#file1801326line252>
> >
> >     Maybe add a metric to track the unban-s?
> >     
> >     Do we really need to `unban` the offer before `cancelling` it? Can we fold `unbanOffer` into the `cancelOffer` method?
> 
> Jordan Ly wrote:
>     Added metrics to check bans and unbans.
>     
>     I would prefer not to fold `unbanOffer` into the `cancelOffer` method as I believe they are distinct enough to leave them as separate calls. I can see use cases where a developer would call one but not the other.
> 
> David McLaughlin wrote:
>     Is cancelOffer currently being called anywhere were we wouldn't want to then ban it if it returns false?
> 
> Santhosh Kumar Shanmugham wrote:
>     I do not think we should be developing code for future usecases that may or may not arise.
>     
>     I see `cancelOffer` as the equivalent of clearing a particular offer. In that case we should clean up all the state related to that particular offer. Leaving it in the banned set means, we are expecting developers to do the clean up themselves.

That makes sense. Is it possible for an offer to not be sent to the framework, but a rescind to be sent? In this case, I don't think cancelOffer would have enough signal to determine whether to ban or unban an offer, it would return false both times as the offer will never be in HostOffers.


- Jordan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/#review183442
-----------------------------------------------------------


On Aug. 22, 2017, 5:05 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61804/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2017, 5:05 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1945
>     https://issues.apache.org/jira/browse/AURORA-1945
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The current race condition for offers is possible:
> ```
> 1. Scheduler receives an offer and adds it to the executor queue for processing.
> 2. The executor processes the offer and adds it to the HostOffers list.
> 3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
> 4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
> ```
> The following logs show this in action:
> 
> Mesos:
> ```
> I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
> W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
> W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
> I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
> ```
> Aurora:
> ```
> I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
> I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
> I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
> I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
> W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
> ``` 
> I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Diffs
> -----
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
>   src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
>   src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 
> 
> 
> Diff: https://reviews.apache.org/r/61804/diff/3/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.
> 
> I will verify this patch on a live cluster as well before submitting.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

Posted by David McLaughlin <da...@dmclaughlin.com>.

> On Aug. 22, 2017, 4:33 a.m., Santhosh Kumar Shanmugham wrote:
> > src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java
> > Line 237 (original), 247-248 (patched)
> > <https://reviews.apache.org/r/61804/diff/2/?file=1801326#file1801326line252>
> >
> >     Maybe add a metric to track the unban-s?
> >     
> >     Do we really need to `unban` the offer before `cancelling` it? Can we fold `unbanOffer` into the `cancelOffer` method?
> 
> Jordan Ly wrote:
>     Added metrics to check bans and unbans.
>     
>     I would prefer not to fold `unbanOffer` into the `cancelOffer` method as I believe they are distinct enough to leave them as separate calls. I can see use cases where a developer would call one but not the other.

Is cancelOffer currently being called anywhere were we wouldn't want to then ban it if it returns false?


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/#review183442
-----------------------------------------------------------


On Aug. 22, 2017, 5:05 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61804/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2017, 5:05 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1945
>     https://issues.apache.org/jira/browse/AURORA-1945
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The current race condition for offers is possible:
> ```
> 1. Scheduler receives an offer and adds it to the executor queue for processing.
> 2. The executor processes the offer and adds it to the HostOffers list.
> 3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
> 4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
> ```
> The following logs show this in action:
> 
> Mesos:
> ```
> I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
> W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
> W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
> I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
> ```
> Aurora:
> ```
> I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
> I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
> I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
> I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
> W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
> ``` 
> I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Diffs
> -----
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
>   src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
>   src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 
> 
> 
> Diff: https://reviews.apache.org/r/61804/diff/3/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.
> 
> I will verify this patch on a live cluster as well before submitting.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

Posted by Santhosh Kumar Shanmugham <sa...@gmail.com>.

> On Aug. 21, 2017, 9:33 p.m., Santhosh Kumar Shanmugham wrote:
> > src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java
> > Line 237 (original), 247-248 (patched)
> > <https://reviews.apache.org/r/61804/diff/2/?file=1801326#file1801326line252>
> >
> >     Maybe add a metric to track the unban-s?
> >     
> >     Do we really need to `unban` the offer before `cancelling` it? Can we fold `unbanOffer` into the `cancelOffer` method?
> 
> Jordan Ly wrote:
>     Added metrics to check bans and unbans.
>     
>     I would prefer not to fold `unbanOffer` into the `cancelOffer` method as I believe they are distinct enough to leave them as separate calls. I can see use cases where a developer would call one but not the other.
> 
> David McLaughlin wrote:
>     Is cancelOffer currently being called anywhere were we wouldn't want to then ban it if it returns false?

I do not think we should be developing code for future usecases that may or may not arise.

I see `cancelOffer` as the equivalent of clearing a particular offer. In that case we should clean up all the state related to that particular offer. Leaving it in the banned set means, we are expecting developers to do the clean up themselves.


- Santhosh Kumar


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/#review183442
-----------------------------------------------------------


On Aug. 22, 2017, 10:05 a.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61804/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2017, 10:05 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1945
>     https://issues.apache.org/jira/browse/AURORA-1945
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The current race condition for offers is possible:
> ```
> 1. Scheduler receives an offer and adds it to the executor queue for processing.
> 2. The executor processes the offer and adds it to the HostOffers list.
> 3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
> 4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
> ```
> The following logs show this in action:
> 
> Mesos:
> ```
> I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
> W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
> W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
> I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
> ```
> Aurora:
> ```
> I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
> I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
> I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
> I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
> W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
> ``` 
> I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Diffs
> -----
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
>   src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
>   src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 
> 
> 
> Diff: https://reviews.apache.org/r/61804/diff/3/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.
> 
> I will verify this patch on a live cluster as well before submitting.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

Posted by Santhosh Kumar Shanmugham <sa...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/#review183442
-----------------------------------------------------------



The solution looks good to me. Some simplification suggested.


src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java
Lines 235 (patched)
<https://reviews.apache.org/r/61804/#comment259446>

    s/hostOffers/offerManager/
    
    Best not to talk about `HostOffers` which is an internal datastructure for `OfferManager` and we should respect the encapsulation in the comments.



src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java
Lines 241 (patched)
<https://reviews.apache.org/r/61804/#comment259445>

    nit - s/Canceled/Cancelled/



src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java
Line 237 (original), 247-248 (patched)
<https://reviews.apache.org/r/61804/#comment259447>

    Maybe add a metric to track the unban-s?
    
    Do we really need to `unban` the offer before `cancelling` it? Can we fold `unbanOffer` into the `cancelOffer` method?


- Santhosh Kumar Shanmugham


On Aug. 21, 2017, 5:38 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61804/
> -----------------------------------------------------------
> 
> (Updated Aug. 21, 2017, 5:38 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1945
>     https://issues.apache.org/jira/browse/AURORA-1945
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The current race condition for offers is possible:
> ```
> 1. Scheduler receives an offer and adds it to the executor queue for processing.
> 2. The executor processes the offer and adds it to the HostOffers list.
> 3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
> 4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
> ```
> The following logs show this in action:
> 
> Mesos:
> ```
> I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
> W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
> W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
> I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
> ```
> Aurora:
> ```
> I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
> I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
> I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
> I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
> W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
> ``` 
> I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Diffs
> -----
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
>   src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
>   src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 
> 
> 
> Diff: https://reviews.apache.org/r/61804/diff/2/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.
> 
> I will verify this patch on a live cluster as well before submitting.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

Posted by Zameer Manji <zm...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/#review183436
-----------------------------------------------------------



I don't have the bandwidth to review this, but this fix seems to be a clever way of fixing this problem and far better than my previous approach.

- Zameer Manji


On Aug. 21, 2017, 5:38 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61804/
> -----------------------------------------------------------
> 
> (Updated Aug. 21, 2017, 5:38 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1945
>     https://issues.apache.org/jira/browse/AURORA-1945
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The current race condition for offers is possible:
> ```
> 1. Scheduler receives an offer and adds it to the executor queue for processing.
> 2. The executor processes the offer and adds it to the HostOffers list.
> 3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
> 4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
> ```
> The following logs show this in action:
> 
> Mesos:
> ```
> I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
> W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
> W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
> I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
> ```
> Aurora:
> ```
> I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
> I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
> I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
> I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
> W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
> ``` 
> I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Diffs
> -----
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
>   src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
>   src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 
> 
> 
> Diff: https://reviews.apache.org/r/61804/diff/2/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.
> 
> I will verify this patch on a live cluster as well before submitting.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/#review183421
-----------------------------------------------------------


Ship it!




Master (aae2b0d) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Aug. 22, 2017, 12:38 a.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61804/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2017, 12:38 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1945
>     https://issues.apache.org/jira/browse/AURORA-1945
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The current race condition for offers is possible:
> ```
> 1. Scheduler receives an offer and adds it to the executor queue for processing.
> 2. The executor processes the offer and adds it to the HostOffers list.
> 3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
> 4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
> ```
> The following logs show this in action:
> 
> Mesos:
> ```
> I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
> W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
> W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
> I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
> ```
> Aurora:
> ```
> I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
> I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
> I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
> I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
> W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
> ``` 
> I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Diffs
> -----
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
>   src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
>   src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 
> 
> 
> Diff: https://reviews.apache.org/r/61804/diff/2/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.
> 
> I will verify this patch on a live cluster as well before submitting.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/#review183414
-----------------------------------------------------------



Master (aae2b0d) is red with this patch.
  ./build-support/jenkins/build.sh

:generateBuildProperties
:processResources
:classes
:jar
:startScripts
:distTar
:distZip
:assemble
:compileJmhJavaNote: /home/jenkins/jenkins-slave/workspace/AuroraBot/src/jmh/java/org/apache/aurora/benchmark/fakes/FakeSchedulerDriver.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

:processJmhResources UP-TO-DATE
:jmhClasses
:checkstyleJmh
:jsHint
:checkstyleMain
:compileTestJava/home/jenkins/jenkins-slave/workspace/AuroraBot/src/test/java/org/apache/aurora/scheduler/thrift/aop/MockDecoratedThrift.java:38: Note: Wrote forwarder org.apache.aurora.scheduler.thrift.aop.MockDecoratedThriftForwarder
@Forward(AnnotatedAuroraAdmin.class)
^
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

:processTestResources
:testClasses
:checkstyleTest[ant:checkstyle] [ERROR] /home/jenkins/jenkins-slave/workspace/AuroraBot/src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java:544: File contains a sequence of empty lines. [RegexpMultiline]
[ant:checkstyle] [ERROR] /home/jenkins/jenkins-slave/workspace/AuroraBot/src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java:547:5: Redundant 'public' modifier. [RedundantModifier]
 FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':checkstyleTest'.
> Checkstyle rule violations were found. See the report at: file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/checkstyle/test.html

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED

Total time: 1 mins 40.897 secs


I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Aug. 21, 2017, 5:38 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61804/
> -----------------------------------------------------------
> 
> (Updated Aug. 21, 2017, 5:38 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1945
>     https://issues.apache.org/jira/browse/AURORA-1945
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The current race condition for offers is possible:
> ```
> 1. Scheduler receives an offer and adds it to the executor queue for processing.
> 2. The executor processes the offer and adds it to the HostOffers list.
> 3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
> 4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
> ```
> The following logs show this in action:
> 
> Mesos:
> ```
> I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
> W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
> W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
> I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
> ```
> Aurora:
> ```
> I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
> I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
> I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
> I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
> W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
> ``` 
> I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Diffs
> -----
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
>   src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
>   src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 
> 
> 
> Diff: https://reviews.apache.org/r/61804/diff/1/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.
> 
> I will verify this patch on a live cluster as well before submitting.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

Posted by Stephan Erb <se...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/#review183584
-----------------------------------------------------------


Ship it!




This patch looks good to me, thanks!You mentioned that you want to test it on a live cluster. Please let us know once you have done this so that we can merge it.

As a side node: Do you have any insight yet why your executor queue is so overloaded? Your log timestamps indicate a serious delay. I suppose this could cause problems in various other places as well.

- Stephan Erb


On Aug. 22, 2017, 11:54 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61804/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2017, 11:54 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1945
>     https://issues.apache.org/jira/browse/AURORA-1945
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The current race condition for offers is possible:
> ```
> 1. Scheduler receives an offer and adds it to the executor queue for processing.
> 2. The executor processes the offer and adds it to the HostOffers list.
> 3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
> 4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
> ```
> The following logs show this in action:
> 
> Mesos:
> ```
> I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
> W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
> W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
> I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
> ```
> Aurora:
> ```
> I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
> I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
> I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
> I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
> W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
> ``` 
> I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Diffs
> -----
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
>   src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
>   src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 
> 
> 
> Diff: https://reviews.apache.org/r/61804/diff/5/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.
> 
> I will verify this patch on a live cluster as well before submitting.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/#review183541
-----------------------------------------------------------


Ship it!




Master (aae2b0d) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Aug. 22, 2017, 9:54 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61804/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2017, 9:54 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1945
>     https://issues.apache.org/jira/browse/AURORA-1945
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The current race condition for offers is possible:
> ```
> 1. Scheduler receives an offer and adds it to the executor queue for processing.
> 2. The executor processes the offer and adds it to the HostOffers list.
> 3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
> 4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
> ```
> The following logs show this in action:
> 
> Mesos:
> ```
> I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
> W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
> W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
> I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
> ```
> Aurora:
> ```
> I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
> I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
> I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
> I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
> W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
> ``` 
> I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Diffs
> -----
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
>   src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
>   src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 
> 
> 
> Diff: https://reviews.apache.org/r/61804/diff/5/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.
> 
> I will verify this patch on a live cluster as well before submitting.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

Posted by Santhosh Kumar Shanmugham <sa...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/#review183538
-----------------------------------------------------------


Ship it!




Ship It!

- Santhosh Kumar Shanmugham


On Aug. 22, 2017, 2:54 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61804/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2017, 2:54 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1945
>     https://issues.apache.org/jira/browse/AURORA-1945
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The current race condition for offers is possible:
> ```
> 1. Scheduler receives an offer and adds it to the executor queue for processing.
> 2. The executor processes the offer and adds it to the HostOffers list.
> 3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
> 4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
> ```
> The following logs show this in action:
> 
> Mesos:
> ```
> I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
> W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
> W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
> I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
> ```
> Aurora:
> ```
> I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
> I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
> I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
> I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
> W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
> ``` 
> I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Diffs
> -----
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
>   src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
>   src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 
> 
> 
> Diff: https://reviews.apache.org/r/61804/diff/5/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.
> 
> I will verify this patch on a live cluster as well before submitting.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/#review183681
-----------------------------------------------------------


Ship it!




Master (aae2b0d) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Aug. 23, 2017, 3:38 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61804/
> -----------------------------------------------------------
> 
> (Updated Aug. 23, 2017, 3:38 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1945
>     https://issues.apache.org/jira/browse/AURORA-1945
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The current race condition for offers is possible:
> ```
> 1. Scheduler receives an offer and adds it to the executor queue for processing.
> 2. The executor processes the offer and adds it to the HostOffers list.
> 3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
> 4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
> ```
> The following logs show this in action:
> 
> Mesos:
> ```
> I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
> W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
> W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
> I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
> ```
> Aurora:
> ```
> I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
> I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
> I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
> I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
> W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
> ``` 
> I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Diffs
> -----
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
>   src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
>   src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 
> 
> 
> Diff: https://reviews.apache.org/r/61804/diff/6/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.
> 
> I will verify this patch on a live cluster as well before submitting.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

Posted by Jordan Ly <jo...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/
-----------------------------------------------------------

(Updated Aug. 23, 2017, 10:38 p.m.)


Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.


Changes
-------

Fix logging messages to not produce a newline.


Bugs: AURORA-1945
    https://issues.apache.org/jira/browse/AURORA-1945


Repository: aurora


Description
-------

The current race condition for offers is possible:
```
1. Scheduler receives an offer and adds it to the executor queue for processing.
2. The executor processes the offer and adds it to the HostOffers list.
3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
```
The following logs show this in action:

Mesos:
```
I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
```
Aurora:
```
I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
``` 
I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933


Diffs (updated)
-----

  src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
  src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
  src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
  src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
  src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
  src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
  src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 


Diff: https://reviews.apache.org/r/61804/diff/6/

Changes: https://reviews.apache.org/r/61804/diff/5-6/


Testing
-------

`./gradlew test`

Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.

I will verify this patch on a live cluster as well before submitting.


Thanks,

Jordan Ly


Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

Posted by Jordan Ly <jo...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/
-----------------------------------------------------------

(Updated Aug. 22, 2017, 9:54 p.m.)


Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.


Changes
-------

Remove "unban" from OfferManager interface.


Bugs: AURORA-1945
    https://issues.apache.org/jira/browse/AURORA-1945


Repository: aurora


Description
-------

The current race condition for offers is possible:
```
1. Scheduler receives an offer and adds it to the executor queue for processing.
2. The executor processes the offer and adds it to the HostOffers list.
3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
```
The following logs show this in action:

Mesos:
```
I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
```
Aurora:
```
I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
``` 
I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933


Diffs (updated)
-----

  src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
  src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
  src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
  src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
  src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
  src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
  src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 


Diff: https://reviews.apache.org/r/61804/diff/5/

Changes: https://reviews.apache.org/r/61804/diff/4-5/


Testing
-------

`./gradlew test`

Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.

I will verify this patch on a live cluster as well before submitting.


Thanks,

Jordan Ly


Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/#review183526
-----------------------------------------------------------


Ship it!




Master (aae2b0d) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Aug. 22, 2017, 8:33 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61804/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2017, 8:33 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1945
>     https://issues.apache.org/jira/browse/AURORA-1945
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The current race condition for offers is possible:
> ```
> 1. Scheduler receives an offer and adds it to the executor queue for processing.
> 2. The executor processes the offer and adds it to the HostOffers list.
> 3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
> 4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
> ```
> The following logs show this in action:
> 
> Mesos:
> ```
> I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
> W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
> W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
> I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
> ```
> Aurora:
> ```
> I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
> I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
> I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
> I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
> W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
> ``` 
> I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Diffs
> -----
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
>   src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
>   src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 
> 
> 
> Diff: https://reviews.apache.org/r/61804/diff/4/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.
> 
> I will verify this patch on a live cluster as well before submitting.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

Posted by Jordan Ly <jo...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/
-----------------------------------------------------------

(Updated Aug. 22, 2017, 8:33 p.m.)


Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.


Changes
-------

Add additional gauge for explicitly tracking globally banned offer sizes


Bugs: AURORA-1945
    https://issues.apache.org/jira/browse/AURORA-1945


Repository: aurora


Description
-------

The current race condition for offers is possible:
```
1. Scheduler receives an offer and adds it to the executor queue for processing.
2. The executor processes the offer and adds it to the HostOffers list.
3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
```
The following logs show this in action:

Mesos:
```
I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
```
Aurora:
```
I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
``` 
I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933


Diffs (updated)
-----

  src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
  src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
  src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
  src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
  src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
  src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
  src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 


Diff: https://reviews.apache.org/r/61804/diff/4/

Changes: https://reviews.apache.org/r/61804/diff/3-4/


Testing
-------

`./gradlew test`

Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.

I will verify this patch on a live cluster as well before submitting.


Thanks,

Jordan Ly


Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/#review183514
-----------------------------------------------------------


Ship it!




Ship It!

- David McLaughlin


On Aug. 22, 2017, 5:05 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61804/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2017, 5:05 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1945
>     https://issues.apache.org/jira/browse/AURORA-1945
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The current race condition for offers is possible:
> ```
> 1. Scheduler receives an offer and adds it to the executor queue for processing.
> 2. The executor processes the offer and adds it to the HostOffers list.
> 3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
> 4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
> ```
> The following logs show this in action:
> 
> Mesos:
> ```
> I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
> W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
> W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
> I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
> ```
> Aurora:
> ```
> I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
> I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
> I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
> I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
> W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
> ``` 
> I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Diffs
> -----
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
>   src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
>   src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 
> 
> 
> Diff: https://reviews.apache.org/r/61804/diff/3/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.
> 
> I will verify this patch on a live cluster as well before submitting.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/#review183490
-----------------------------------------------------------


Ship it!




Master (aae2b0d) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Aug. 22, 2017, 5:05 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61804/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2017, 5:05 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1945
>     https://issues.apache.org/jira/browse/AURORA-1945
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The current race condition for offers is possible:
> ```
> 1. Scheduler receives an offer and adds it to the executor queue for processing.
> 2. The executor processes the offer and adds it to the HostOffers list.
> 3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
> 4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
> ```
> The following logs show this in action:
> 
> Mesos:
> ```
> I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
> W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
> W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
> I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
> ```
> Aurora:
> ```
> I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
> I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
> I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
> I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
> W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
> ``` 
> I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Diffs
> -----
> 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
>   src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
>   src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 
> 
> 
> Diff: https://reviews.apache.org/r/61804/diff/3/
> 
> 
> Testing
> -------
> 
> `./gradlew test`
> 
> Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.
> 
> I will verify this patch on a live cluster as well before submitting.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

Posted by Jordan Ly <jo...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/
-----------------------------------------------------------

(Updated Aug. 22, 2017, 5:05 p.m.)


Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.


Changes
-------

Fixed spelling/word choice nits, added metric for global bans.


Bugs: AURORA-1945
    https://issues.apache.org/jira/browse/AURORA-1945


Repository: aurora


Description
-------

The current race condition for offers is possible:
```
1. Scheduler receives an offer and adds it to the executor queue for processing.
2. The executor processes the offer and adds it to the HostOffers list.
3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
```
The following logs show this in action:

Mesos:
```
I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
```
Aurora:
```
I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
``` 
I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933


Diffs (updated)
-----

  src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
  src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
  src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
  src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
  src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
  src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
  src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 


Diff: https://reviews.apache.org/r/61804/diff/3/

Changes: https://reviews.apache.org/r/61804/diff/2-3/


Testing
-------

`./gradlew test`

Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.

I will verify this patch on a live cluster as well before submitting.


Thanks,

Jordan Ly


Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted

Posted by Jordan Ly <jo...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61804/
-----------------------------------------------------------

(Updated Aug. 22, 2017, 12:38 a.m.)


Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji.


Bugs: AURORA-1945
    https://issues.apache.org/jira/browse/AURORA-1945


Repository: aurora


Description (updated)
-------

The current race condition for offers is possible:
```
1. Scheduler receives an offer and adds it to the executor queue for processing.
2. The executor processes the offer and adds it to the HostOffers list.
3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it.
4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST.
```
The following logs show this in action:

Mesos:
```
I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources...
W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid
W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid
I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid'
```
Aurora:
```
I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X 
I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] 
I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid 
I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X
W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X.
``` 
I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933


Diffs
-----

  src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 
  src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 
  src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 
  src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b 
  src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e 
  src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 
  src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda 


Diff: https://reviews.apache.org/r/61804/diff/1/


Testing
-------

`./gradlew test`

Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully.

I will verify this patch on a live cluster as well before submitting.


Thanks,

Jordan Ly