You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@aurora.apache.org by David McLaughlin <da...@dmclaughlin.com> on 2017/11/07 19:28:58 UTC

Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/
-----------------------------------------------------------

Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.


Repository: aurora


Description
-------

This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#

I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.


Diffs
-----

  api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9 
  examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
  src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
  src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
  src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
  src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
  src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1 
  src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
  src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
  src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
  src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 


Diff: https://reviews.apache.org/r/63536/diff/1/


Testing
-------

Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:

1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)

I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.


File Attachments
----------------

Task in PARTITIONED state
  https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
Task back into running when partition resolved
  https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
Compaction of PARTITIONED cycles (note timestamps)
  https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png


Thanks,

David McLaughlin


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Jordan Ly <jo...@gmail.com>.

> On Nov. 9, 2017, 1:10 a.m., Jordan Ly wrote:
> > src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java
> > Lines 402 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885209#file1885209line402>
> >
> >     I wondering if having repeated `PARTITIONED` events would be helpful in debugging issues, and if it is worth having to deal with the possibility of an unbounded TaskEvent list.
> >     
> >     I don't really have a strong opinion yet, but I would be curious as to what others think.
> 
> Bill Farner wrote:
>     I agree to an extent, but the pathological case is acutely painful.  It could mean an irrecoverable cluster, and tracking down the reason would be challenging.
>     
>     > debugging
>     
>     Do you think preserving the count helps here?  Between the count as an indicator of a problem and logs to dig deeper, i can't see any loss in signal.

You are definitely correct! I was originially thinking we would want to preserve timestamps, but I forgot about the logs and having the count is a strong enough signal to go into the logs which has all the information we need.


- Jordan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review190521
-----------------------------------------------------------


On Nov. 9, 2017, 12:48 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 9, 2017, 12:48 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/2/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Bill Farner <wf...@apache.org>.

> On Nov. 8, 2017, 5:10 p.m., Jordan Ly wrote:
> > src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java
> > Lines 402 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885209#file1885209line402>
> >
> >     I wondering if having repeated `PARTITIONED` events would be helpful in debugging issues, and if it is worth having to deal with the possibility of an unbounded TaskEvent list.
> >     
> >     I don't really have a strong opinion yet, but I would be curious as to what others think.

I agree to an extent, but the pathological case is acutely painful.  It could mean an irrecoverable cluster, and tracking down the reason would be challenging.

> debugging

Do you think preserving the count helps here?  Between the count as an indicator of a problem and logs to dig deeper, i can't see any loss in signal.


- Bill


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review190521
-----------------------------------------------------------


On Nov. 8, 2017, 4:48 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 8, 2017, 4:48 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/2/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.

> On Nov. 9, 2017, 1:10 a.m., Jordan Ly wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 57 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885207#file1885207line57>
> >
> >     I don't think you need to `synchronize` here since it is a `ConcurrentHashMap`. You can probably just call `remove`, and if the value it returns is not `null` you can cancel it.
> >     
> >     This way, you can remove the double-checked locking.

Removed the map.


> On Nov. 9, 2017, 1:10 a.m., Jordan Ly wrote:
> > src/main/java/org/apache/aurora/scheduler/state/SideEffect.java
> > Lines 73 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885208#file1885208line73>
> >
> >     I would rename this to avoid confusion. Maybe make it a verb like `TRANSITION_TO_LOST`

Done.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review190521
-----------------------------------------------------------


On Nov. 14, 2017, 10:41 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 14, 2017, 10:41 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/4/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Jordan Ly <jo...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review190521
-----------------------------------------------------------



Overall the logic seems sound to me!


src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
Lines 57 (patched)
<https://reviews.apache.org/r/63536/#comment267968>

    I don't think you need to `synchronize` here since it is a `ConcurrentHashMap`. You can probably just call `remove`, and if the value it returns is not `null` you can cancel it.
    
    This way, you can remove the double-checked locking.



src/main/java/org/apache/aurora/scheduler/state/SideEffect.java
Lines 73 (patched)
<https://reviews.apache.org/r/63536/#comment267975>

    I would rename this to avoid confusion. Maybe make it a verb like `TRANSITION_TO_LOST`



src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java
Lines 402 (patched)
<https://reviews.apache.org/r/63536/#comment267970>

    I wondering if having repeated `PARTITIONED` events would be helpful in debugging issues, and if it is worth having to deal with the possibility of an unbounded TaskEvent list.
    
    I don't really have a strong opinion yet, but I would be curious as to what others think.


- Jordan Ly


On Nov. 9, 2017, 12:48 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 9, 2017, 12:48 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/2/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review190632
-----------------------------------------------------------



Thanks for all the feedback! I'll move forward towards a production quality solution now.

- David McLaughlin


On Nov. 9, 2017, 12:48 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 9, 2017, 12:48 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/2/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review190758
-----------------------------------------------------------



Master (9b9b2ee) is red with this patch.
  ./build-support/jenkins/build.sh

    java.lang.IllegalArgumentException

org.apache.aurora.scheduler.storage.db.TaskStoreTest > testSaveWithMesosFetcherUris FAILED
    java.lang.AssertionError

org.apache.aurora.scheduler.storage.db.JobUpdateStoreTest > testMultipleJobDetails FAILED
    java.lang.IllegalArgumentException

org.apache.aurora.scheduler.storage.db.TaskStoreTest > testNullVsEmptyRelations FAILED
    com.google.common.util.concurrent.UncheckedExecutionException
        Caused by: java.lang.IllegalArgumentException

org.apache.aurora.scheduler.storage.db.TaskStoreTest > testQueryMultipleInstances FAILED
    java.lang.AssertionError

org.apache.aurora.scheduler.storage.db.TaskStoreTest > testAddSlaveHost FAILED
    java.lang.Exception
        Caused by: java.lang.AssertionError
I1111 00:05:32.878 [ShutdownHook, SchedulerMain] Stopping scheduler services. 
I1111 00:05:33.001 [SessionTracker, SessionTrackerImpl] SessionTrackerImpl exited loop! 

1195 tests completed, 54 failed, 2 skipped
:test FAILED
:jacocoTestReport
Coverage report generated: file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/jacoco/test/html/index.html
:jacocoTestCoverageVerification

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':test'.
> There were failing tests. See the report at: file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/tests/test/index.html

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

* Get more help at https://help.gradle.org

BUILD FAILED in 14m 49s
44 actionable tasks: 35 executed, 9 up-to-date


I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Nov. 10, 2017, 11:45 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 10, 2017, 11:45 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/3/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Jordan Ly <jo...@gmail.com>.

> On Nov. 21, 2017, 5:54 p.m., David McLaughlin wrote:
> > The outstanding feedback is:
> > 
> > * Discussion on using a mock with capture (my preference) vs using a fake implementation. 
> > * Concern over how to handle a partition during a transient state. My preference is to move immediately to LOST (existing behavior).
> > * Confusion over what Disable meant (now removed).
> > 
> > I would appreciate other thoughts on these, as I'm eager to close this work out and move on to other things.
> 
> Stephan Erb wrote:
>     https://github.com/apache/aurora/blob/master/docs/reference/task-lifecycle.md will also need an update. I am OK if you ignore the image as it is already outdated, but the text should reflect reality.

My thoughts:

* I think that the current tests are alright -- they test the behavior of PartitionManager in isolation and are fairly easy to read. I imagine that you mean 'integration test' by 'fake implementation', but please correct me if I am mistaken. On a side note: do we ever explicitly test that something is never called (i.e. `expect(...).times(0)`)? I know EasyMock's default behavior is that if something is called that isn't supposed to be called, it will fail. However, when reading tests, I feel like it the explicit declaration flows smoother in my mind.
* +1 for the existing behavior.
* +1 to removing Disable


- Jordan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review191603
-----------------------------------------------------------


On Nov. 21, 2017, 6:42 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2017, 6:42 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md d653b797533735fc936a1cff2e005813757e1290 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   docs/reference/configuration.md f647bcf86e84c3b08848e5f54f6a5ad95d55fc8a 
>   docs/reference/scheduler-configuration.md d17541f9458650240983276b69f749a854057aa8 
>   docs/reference/task-lifecycle.md 8ec0077f7c5caef9c012db02ffd40c704fc2b347 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
>   src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
>   src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/10/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Stephan Erb <se...@apache.org>.

> On Nov. 21, 2017, 6:54 p.m., David McLaughlin wrote:
> > The outstanding feedback is:
> > 
> > * Discussion on using a mock with capture (my preference) vs using a fake implementation. 
> > * Concern over how to handle a partition during a transient state. My preference is to move immediately to LOST (existing behavior).
> > * Confusion over what Disable meant (now removed).
> > 
> > I would appreciate other thoughts on these, as I'm eager to close this work out and move on to other things.

https://github.com/apache/aurora/blob/master/docs/reference/task-lifecycle.md will also need an update. I am OK if you ignore the image as it is already outdated, but the text should reflect reality.


- Stephan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review191603
-----------------------------------------------------------


On Nov. 21, 2017, 6:50 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2017, 6:50 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md d653b797533735fc936a1cff2e005813757e1290 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
>   src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
>   src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/8/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Bill Farner <wf...@apache.org>.

> On Nov. 21, 2017, 9:54 a.m., David McLaughlin wrote:
> > The outstanding feedback is:
> > 
> > * Discussion on using a mock with capture (my preference) vs using a fake implementation. 
> > * Concern over how to handle a partition during a transient state. My preference is to move immediately to LOST (existing behavior).
> > * Confusion over what Disable meant (now removed).
> > 
> > I would appreciate other thoughts on these, as I'm eager to close this work out and move on to other things.
> 
> Stephan Erb wrote:
>     https://github.com/apache/aurora/blob/master/docs/reference/task-lifecycle.md will also need an update. I am OK if you ignore the image as it is already outdated, but the text should reflect reality.
> 
> Jordan Ly wrote:
>     My thoughts:
>     
>     * I think that the current tests are alright -- they test the behavior of PartitionManager in isolation and are fairly easy to read. I imagine that you mean 'integration test' by 'fake implementation', but please correct me if I am mistaken. On a side note: do we ever explicitly test that something is never called (i.e. `expect(...).times(0)`)? I know EasyMock's default behavior is that if something is called that isn't supposed to be called, it will fail. However, when reading tests, I feel like it the explicit declaration flows smoother in my mind.
>     * +1 for the existing behavior.
>     * +1 to removing Disable

I don't feel strongly about any of the outstanding issues.


- Bill


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review191603
-----------------------------------------------------------


On Nov. 21, 2017, 10:42 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2017, 10:42 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md d653b797533735fc936a1cff2e005813757e1290 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   docs/reference/configuration.md f647bcf86e84c3b08848e5f54f6a5ad95d55fc8a 
>   docs/reference/scheduler-configuration.md d17541f9458650240983276b69f749a854057aa8 
>   docs/reference/task-lifecycle.md 8ec0077f7c5caef9c012db02ffd40c704fc2b347 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
>   src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
>   src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/10/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.

> On Nov. 21, 2017, 5:54 p.m., David McLaughlin wrote:
> > The outstanding feedback is:
> > 
> > * Discussion on using a mock with capture (my preference) vs using a fake implementation. 
> > * Concern over how to handle a partition during a transient state. My preference is to move immediately to LOST (existing behavior).
> > * Confusion over what Disable meant (now removed).
> > 
> > I would appreciate other thoughts on these, as I'm eager to close this work out and move on to other things.
> 
> Stephan Erb wrote:
>     https://github.com/apache/aurora/blob/master/docs/reference/task-lifecycle.md will also need an update. I am OK if you ignore the image as it is already outdated, but the text should reflect reality.
> 
> Jordan Ly wrote:
>     My thoughts:
>     
>     * I think that the current tests are alright -- they test the behavior of PartitionManager in isolation and are fairly easy to read. I imagine that you mean 'integration test' by 'fake implementation', but please correct me if I am mistaken. On a side note: do we ever explicitly test that something is never called (i.e. `expect(...).times(0)`)? I know EasyMock's default behavior is that if something is called that isn't supposed to be called, it will fail. However, when reading tests, I feel like it the explicit declaration flows smoother in my mind.
>     * +1 for the existing behavior.
>     * +1 to removing Disable
> 
> Bill Farner wrote:
>     I don't feel strongly about any of the outstanding issues.

Okay, great. In that case I will go ahead and ship this.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review191603
-----------------------------------------------------------


On Nov. 21, 2017, 6:42 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2017, 6:42 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md d653b797533735fc936a1cff2e005813757e1290 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   docs/reference/configuration.md f647bcf86e84c3b08848e5f54f6a5ad95d55fc8a 
>   docs/reference/scheduler-configuration.md d17541f9458650240983276b69f749a854057aa8 
>   docs/reference/task-lifecycle.md 8ec0077f7c5caef9c012db02ffd40c704fc2b347 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
>   src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
>   src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/10/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review191603
-----------------------------------------------------------



The outstanding feedback is:

* Discussion on using a mock with capture (my preference) vs using a fake implementation. 
* Concern over how to handle a partition during a transient state. My preference is to move immediately to LOST (existing behavior).
* Confusion over what Disable meant (now removed).

I would appreciate other thoughts on these, as I'm eager to close this work out and move on to other things.

- David McLaughlin


On Nov. 21, 2017, 5:50 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2017, 5:50 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md d653b797533735fc936a1cff2e005813757e1290 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
>   src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
>   src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/8/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review191608
-----------------------------------------------------------



Master (46b1112) is red with this patch.
  ./build-support/jenkins/build.sh

  Running setup.py bdist_wheel for twitter.common.log: finished with status 'done'
  Stored in directory: /home/jenkins/jenkins-slave/workspace/AuroraBot/.home/.cache/pip/wheels/80/e3/3a/297d8950fcbd822ab5a6a62175818fab38b668cc5a86dbd0b0
  Running setup.py bdist_wheel for pycparser: started
  Running setup.py bdist_wheel for pycparser: finished with status 'done'
  Stored in directory: /home/jenkins/jenkins-slave/workspace/AuroraBot/.home/.cache/pip/wheels/95/14/9a/5e7b9024459d2a6600aaa64e0ba485325aff7a9ac7489db1b6
  Running setup.py bdist_wheel for twitter.common.options: started
  Running setup.py bdist_wheel for twitter.common.options: finished with status 'done'
  Stored in directory: /home/jenkins/jenkins-slave/workspace/AuroraBot/.home/.cache/pip/wheels/17/41/80/c4811d8c1c7ca7007e520c3399872fc340f45c3a26a6a23e6a
Successfully built pantsbuild.pants twitter.common.collections setproctitle ansicolors pathspec scandir twitter.common.dirutil psutil pystache docutils Markdown Pygments twitter.common.confluence coverage lmdb pywatchman twitter.common.lang twitter.common.log pycparser twitter.common.options
Installing collected packages: twitter.common.lang, twitter.common.collections, setproctitle, setuptools, six, ansicolors, pyparsing, packaging, pathspec, scandir, twitter.common.dirutil, psutil, requests, pystache, pex, docutils, Markdown, Pygments, twitter.common.options, twitter.common.log, twitter.common.confluence, monotonic, fasteners, coverage, pycparser, cffi, lmdb, pywatchman, futures, pantsbuild.pants
  Found existing installation: setuptools 21.2.1
    Uninstalling setuptools-21.2.1:
      Successfully uninstalled setuptools-21.2.1
Successfully installed Markdown-2.1.1 Pygments-1.4 ansicolors-1.0.2 cffi-1.7.0 coverage-3.7.1 docutils-0.12 fasteners-0.14.1 futures-3.0.5 lmdb-0.89 monotonic-1.4 packaging-16.7 pantsbuild.pants-1.3.0.dev3 pathspec-0.3.4 pex-1.1.16 psutil-4.3.0 pycparser-2.18 pyparsing-2.2.0 pystache-0.5.3 pywatchman-1.3.0 requests-2.5.3 scandir-1.2 setproctitle-1.1.10 setuptools-30.0.0 six-1.11.0 twitter.common.collections-0.3.9 twitter.common.confluence-0.3.9 twitter.common.dirutil-0.3.9 twitter.common.lang-0.3.9 twitter.common.log-0.3.9 twitter.common.options-0.3.9
You are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

18:00:44 00:00 [main]
               (To run a reporting server: ./pants server)
18:00:44 00:00   [setup]
18:00:44 00:00     [parse]
               Executing tasks in goals: compile
18:00:44 00:00   [compile]
18:00:44 00:00     [compile-prep-command]
18:00:44 00:00       [prep_command]
18:00:48 00:04     [compile]
18:00:48 00:04     [python-eval]
18:00:48 00:04     [pythonstyle]
18:00:49 00:05       [cache]                                          
                   No cached artifacts for 42 targets.
                   Invalidated 42 targets.
F401:ERROR   src/main/python/apache/aurora/client/cli/jobs.py:066 'PystachioPartitionPolicy' imported but unused
     |from apache.aurora.config.schema.base import PartitionPolicy as PystachioPartitionPolicy


FAILURE: 1 Python Style issues found. For import order related issues, please try `./pants fmt.isort <targets>`


18:01:05 00:21   [complete]
               FAILURE


I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Nov. 21, 2017, 5:50 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2017, 5:50 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md d653b797533735fc936a1cff2e005813757e1290 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
>   src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
>   src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/8/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review191616
-----------------------------------------------------------


Ship it!




Master (46b1112) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Nov. 21, 2017, 6:18 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2017, 6:18 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md d653b797533735fc936a1cff2e005813757e1290 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
>   src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
>   src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/9/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review191626
-----------------------------------------------------------


Ship it!




Master (46b1112) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Nov. 21, 2017, 6:42 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2017, 6:42 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md d653b797533735fc936a1cff2e005813757e1290 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   docs/reference/configuration.md f647bcf86e84c3b08848e5f54f6a5ad95d55fc8a 
>   docs/reference/scheduler-configuration.md d17541f9458650240983276b69f749a854057aa8 
>   docs/reference/task-lifecycle.md 8ec0077f7c5caef9c012db02ffd40c704fc2b347 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
>   src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
>   src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/10/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Stephan Erb <se...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review191666
-----------------------------------------------------------


Ship it!




LGTM. Great new feature!


RELEASE-NOTES.md
Lines 12-13 (patched)
<https://reviews.apache.org/r/63536/#comment269507>

    Should we prevent that users submit a non-default reschedule policy if the scheduler feature toggle is disabled? 
    (I am fine either way but just want to point it out)


- Stephan Erb


On Nov. 21, 2017, 7:42 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2017, 7:42 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md d653b797533735fc936a1cff2e005813757e1290 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   docs/reference/configuration.md f647bcf86e84c3b08848e5f54f6a5ad95d55fc8a 
>   docs/reference/scheduler-configuration.md d17541f9458650240983276b69f749a854057aa8 
>   docs/reference/task-lifecycle.md 8ec0077f7c5caef9c012db02ffd40c704fc2b347 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
>   src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
>   src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/10/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/
-----------------------------------------------------------

(Updated Nov. 21, 2017, 6:42 p.m.)


Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.


Changes
-------

Update docs.


Repository: aurora


Description
-------

This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#

I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.


Diffs (updated)
-----

  RELEASE-NOTES.md d653b797533735fc936a1cff2e005813757e1290 
  api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
  docs/reference/configuration.md f647bcf86e84c3b08848e5f54f6a5ad95d55fc8a 
  docs/reference/scheduler-configuration.md d17541f9458650240983276b69f749a854057aa8 
  docs/reference/task-lifecycle.md 8ec0077f7c5caef9c012db02ffd40c704fc2b347 
  examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
  src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
  src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
  src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
  src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
  src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
  src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
  src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
  src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
  src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
  src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
  src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
  src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
  src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
  src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
  src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
  src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
  src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
  src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
  src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
  src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
  src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 


Diff: https://reviews.apache.org/r/63536/diff/10/

Changes: https://reviews.apache.org/r/63536/diff/9-10/


Testing
-------

Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:

1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)

I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.


File Attachments
----------------

Task in PARTITIONED state
  https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
Task back into running when partition resolved
  https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
Compaction of PARTITIONED cycles (note timestamps)
  https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png


Thanks,

David McLaughlin


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/
-----------------------------------------------------------

(Updated Nov. 21, 2017, 6:18 p.m.)


Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.


Changes
-------

Green build.


Repository: aurora


Description
-------

This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#

I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.


Diffs (updated)
-----

  RELEASE-NOTES.md d653b797533735fc936a1cff2e005813757e1290 
  api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
  examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
  src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
  src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
  src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
  src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
  src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
  src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
  src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
  src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
  src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
  src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
  src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
  src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
  src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
  src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
  src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
  src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
  src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
  src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
  src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
  src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
  src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 


Diff: https://reviews.apache.org/r/63536/diff/9/

Changes: https://reviews.apache.org/r/63536/diff/8-9/


Testing
-------

Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:

1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)

I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.


File Attachments
----------------

Task in PARTITIONED state
  https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
Task back into running when partition resolved
  https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
Compaction of PARTITIONED cycles (note timestamps)
  https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png


Thanks,

David McLaughlin


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/
-----------------------------------------------------------

(Updated Nov. 21, 2017, 5:50 p.m.)


Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.


Changes
-------

Feedback, and remove Disable and just revert to explicit PartitionPolicy in pystachio.


Repository: aurora


Description
-------

This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#

I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.


Diffs (updated)
-----

  RELEASE-NOTES.md d653b797533735fc936a1cff2e005813757e1290 
  api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
  examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
  src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
  src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
  src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
  src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
  src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
  src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
  src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
  src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
  src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
  src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
  src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
  src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
  src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
  src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
  src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
  src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
  src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
  src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
  src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
  src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
  src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 


Diff: https://reviews.apache.org/r/63536/diff/8/

Changes: https://reviews.apache.org/r/63536/diff/7-8/


Testing
-------

Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:

1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)

I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.


File Attachments
----------------

Task in PARTITIONED state
  https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
Task back into running when partition resolved
  https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
Compaction of PARTITIONED cycles (note timestamps)
  https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png


Thanks,

David McLaughlin


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review191261
-----------------------------------------------------------


Ship it!




Master (46b1112) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Nov. 16, 2017, 10:30 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 16, 2017, 10:30 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md d653b797533735fc936a1cff2e005813757e1290 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
>   src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
>   src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/7/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/
-----------------------------------------------------------

(Updated Nov. 16, 2017, 10:30 p.m.)


Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.


Changes
-------

Add comments to explain the rationale behind why we ignore the PartitionPolicy and move straight to LOST when tasks are being killed/preempted/restarted/drained.


Repository: aurora


Description
-------

This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#

I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.


Diffs (updated)
-----

  RELEASE-NOTES.md d653b797533735fc936a1cff2e005813757e1290 
  api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
  examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
  src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
  src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
  src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
  src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
  src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
  src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
  src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
  src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
  src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
  src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
  src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
  src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
  src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
  src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
  src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
  src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
  src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
  src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
  src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
  src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
  src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 


Diff: https://reviews.apache.org/r/63536/diff/7/

Changes: https://reviews.apache.org/r/63536/diff/6-7/


Testing
-------

Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:

1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)

I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.


File Attachments
----------------

Task in PARTITIONED state
  https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
Task back into running when partition resolved
  https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
Compaction of PARTITIONED cycles (note timestamps)
  https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png


Thanks,

David McLaughlin


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review191144
-----------------------------------------------------------


Ship it!




Master (46b1112) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Nov. 16, 2017, 9:54 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 16, 2017, 9:54 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
>   src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
>   src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/6/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.

> On Nov. 16, 2017, 10:22 p.m., Jordan Ly wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 90-91 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893876#file1893876line90>
> >
> >     Can this just be:
> >     ```
> >     if (stateChange.getNewState().equals(ScheduleStatus.PARTITIONED))
> >     ```

partitionPolicy is optional. I'll explain this check with a comment.


> On Nov. 16, 2017, 10:22 p.m., Jordan Ly wrote:
> > src/main/python/apache/aurora/config/schema/base.py
> > Lines 157 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893882#file1893882line157>
> >
> >     Can you explain what `Disable` does? If it is `RescheduleImmediately`, then the default 0 on `PartitionPolicy` is sufficient.

Hmm, Disable might not be a good name for this actually. It's more like NeverReschedule. But that is also not a good name because we would give up on the partition in certain conditions (user or operator initiated actions).


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review191256
-----------------------------------------------------------


On Nov. 16, 2017, 10:30 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 16, 2017, 10:30 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md d653b797533735fc936a1cff2e005813757e1290 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
>   src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
>   src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/7/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Jordan Ly <jo...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review191256
-----------------------------------------------------------


Fix it, then Ship it!




Overall LGTM! You may want to add some entry to the release notes explaining how Mesos should fix some bugs on their end before enabling this.


src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
Lines 90-91 (patched)
<https://reviews.apache.org/r/63536/#comment268977>

    Can this just be:
    ```
    if (stateChange.getNewState().equals(ScheduleStatus.PARTITIONED))
    ```



src/main/python/apache/aurora/config/schema/base.py
Lines 157 (patched)
<https://reviews.apache.org/r/63536/#comment268978>

    Can you explain what `Disable` does? If it is `RescheduleImmediately`, then the default 0 on `PartitionPolicy` is sufficient.


- Jordan Ly


On Nov. 16, 2017, 1:54 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 16, 2017, 1:54 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
>   src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
>   src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/6/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.

> On Nov. 16, 2017, 10:01 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java
> > Lines 185 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893875#file1893875line185>
> >
> >     Nit: `isPartitionAware` to minutely improve readability.

Done.


> On Nov. 16, 2017, 10:01 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 91 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893876#file1893876line91>
> >
> >     A `LOG.info` on either side of the branch on task configuration would be nice to simplify production debugging.
> >     
> >     `Partitioned task will be rescheduled in %s seconds`
> >     
> >     `Partitioned task will not be rescheduled`

Done.


> On Nov. 16, 2017, 10:01 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java
> > Line 266 (original), 307 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893880#file1893880line307>
> >
> >     Nit: order the states consistently where possible.  Makes it easier to scan for where states are handled differently.  The prevailing convention in this patch seems to be `PARTITIONED` last, which sounds fine to me.

Done.


> On Nov. 16, 2017, 10:01 p.m., Bill Farner wrote:
> > src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java
> > Lines 57 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893887#file1893887line57>
> >
> >     nit: rm line

Done.


> On Nov. 16, 2017, 10:01 p.m., Bill Farner wrote:
> > src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java
> > Lines 60 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893887#file1893887line60>
> >
> >     nit: static; ditto elsewhere

Done.


> On Nov. 16, 2017, 10:01 p.m., Bill Farner wrote:
> > src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java
> > Lines 74-75 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893887#file1893887line74>
> >
> >     can you swap old/new param order?  i was confused by some callers until i came up here :-)

Done.


> On Nov. 16, 2017, 10:01 p.m., Bill Farner wrote:
> > src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java
> > Lines 431 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893888#file1893888line431>
> >
> >     Please phrase this check as:
> >     
> >     ```java
> >     assertEquals(
> >       ImmutableList.of(PENDING, ASSIGNED, ...),
> >       updatedTask.getTaskEvents().stream()
> >           .map(e -> e.getStatus())
> >           .collect(Collectors.toList()));
> >     ```
> >     
> >     This will give greater confidence that the compaction worked as intended.

Much better, thanks. Done.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review191229
-----------------------------------------------------------


On Nov. 21, 2017, 5:50 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2017, 5:50 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md d653b797533735fc936a1cff2e005813757e1290 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
>   src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
>   src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/8/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.

> On Nov. 16, 2017, 10:01 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java
> > Lines 209 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893880#file1893880line209>
> >
> >     I suggest keeping this as a valid transition and allowing the PartitionManager to deal with it.

See the comment I added to this transition. The TL;DR is it could indefinitely block restarts, maintenance and preemption when reschedule=False. I don't think this is desirable. Today any such indefinite delays are handled by transient task timeouts, but obviously we can't apply those timeouts to PARTITIONED. 

My conclusion after much deliberation (I mention some of this about duplicate instances in the design doc) was that it's fine just to move to LOST (which is the behavior today). The reasoning I ended up with is that PartitionPolicy is intended for keeping tasks *running*, and that partitions encountered when tearing tasks down are far less important (or not important at all). 

At the same time, it's a discussion worth fleshing out. The other alternative I had considered was to simply ignore PARTITIONED when restarting and wait for the transient task timeout to hit. But because we'd only want to do that on reschedule=False, this logic would have to live in PartitionManager. But this would not solve the problem of actually waiting indefinitely when reschedule=False.


> On Nov. 16, 2017, 10:01 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java
> > Lines 281 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893880#file1893880line281>
> >
> >     If `FAILED` is in this list, i would expect `FINISHED` as well.

Not sure how I missed that. Thanks!


> On Nov. 16, 2017, 10:01 p.m., Bill Farner wrote:
> > src/main/python/apache/aurora/config/schema/base.py
> > Lines 157 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893882#file1893882line157>
> >
> >     I would prefer the name `RescheduleImmediately` (or similar) to make this self-documenting.

See my reply to Jordan. Disable does not mean this, it's an alias for reschedule=False. I lifted the name Disabled straight from the comment in the proposal doc, but looks like it's added confusion. I'm thinking of just reverting that suggestion and going back to explicit PartitionPolicy.


> On Nov. 16, 2017, 10:01 p.m., Bill Farner wrote:
> > src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java
> > Lines 55 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893887#file1893887line55>
> >
> >     (disclaimer: see my longer comment about an integration test before trying to satisfy other comments in this file)
> >     
> >     Have a look at `FakeScheduledExecutor` and see if you find it helpful.  The API is a bit weird, but it aims to simplify this type of test where you need a `Clock` and a `ScheduledExecutor`.  At the very least, it eliminates the noise of using `Capture`s.
> 
> David McLaughlin wrote:
>     I actually found the capture much easier to read and reason about than the FakeScheduledExecutor, and I thought it led to some pretty clean tests. Curious how others feel.

I had seen the FakeScheduledExecutor and opted not to use it. I actually find the capture much easier to read and reason about. Curious how others feel.


> On Nov. 16, 2017, 10:01 p.m., Bill Farner wrote:
> > src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java
> > Lines 55 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893887#file1893887line55>
> >
> >     (disclaimer: see my longer comment about an integration test before trying to satisfy other comments in this file)
> >     
> >     Have a look at `FakeScheduledExecutor` and see if you find it helpful.  The API is a bit weird, but it aims to simplify this type of test where you need a `Clock` and a `ScheduledExecutor`.  At the very least, it eliminates the noise of using `Capture`s.
> 
> David McLaughlin wrote:
>     I had seen the FakeScheduledExecutor and opted not to use it. I actually find the capture much easier to read and reason about. Curious how others feel.

I actually found the capture much easier to read and reason about than the FakeScheduledExecutor, and I thought it led to some pretty clean tests. Curious how others feel.


> On Nov. 16, 2017, 10:01 p.m., Bill Farner wrote:
> > src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java
> > Lines 109-115 (patched)
> > <https://reviews.apache.org/r/63536/diff/6/?file=1893887#file1893887line109>
> >
> >     Test cases like this lead me to believe that an integration test is more appropriate.  This test case reads as "nothing happens when a task with reschedule=false transitions from RUNNING to PARTITIONED".  What we really mean, though, is that "a task with reschedule=false immediately moves to LOST upon transition attempt from RUNNING to PARTITIONED".
> >     
> >     With this in mind, i suggest you eliminate this test and move the coverage into `StateManagerImplTest`.  I think this is rather natural, since they work in unison to manage state transitions.
> >     
> >     I don't feel strongly about this, since it is not the convention in the codebase, and i don't know what hurdles you may encounter if you try to bundle this behavior in `StateManagerImplTest`.

> What we really mean, though, is that "a task with reschedule=false immediately moves to LOST upon transition attempt from RUNNING to PARTITIONED".

reschedule=False means that a transition to LOST is never called, which I think this test captures well. 

I thought the fact that I ended up having to test the TaskStateMachine in StateManagerImpl was a huge code smell :/


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review191229
-----------------------------------------------------------


On Nov. 16, 2017, 10:30 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 16, 2017, 10:30 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md d653b797533735fc936a1cff2e005813757e1290 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
>   src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
>   src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/7/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Bill Farner <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review191229
-----------------------------------------------------------


Fix it, then Ship it!




Looks great overall!  The e2e test is very easy to follow.


src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java
Lines 185 (patched)
<https://reviews.apache.org/r/63536/#comment268905>

    Nit: `isPartitionAware` to minutely improve readability.



src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
Lines 91 (patched)
<https://reviews.apache.org/r/63536/#comment268912>

    A `LOG.info` on either side of the branch on task configuration would be nice to simplify production debugging.
    
    `Partitioned task will be rescheduled in %s seconds`
    
    `Partitioned task will not be rescheduled`



src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java
Lines 209 (patched)
<https://reviews.apache.org/r/63536/#comment268953>

    I suggest keeping this as a valid transition and allowing the PartitionManager to deal with it.



src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java
Lines 281 (patched)
<https://reviews.apache.org/r/63536/#comment268956>

    If `FAILED` is in this list, i would expect `FINISHED` as well.



src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java
Line 266 (original), 307 (patched)
<https://reviews.apache.org/r/63536/#comment268955>

    Nit: order the states consistently where possible.  Makes it easier to scan for where states are handled differently.  The prevailing convention in this patch seems to be `PARTITIONED` last, which sounds fine to me.



src/main/python/apache/aurora/config/schema/base.py
Lines 157 (patched)
<https://reviews.apache.org/r/63536/#comment268963>

    I would prefer the name `RescheduleImmediately` (or similar) to make this self-documenting.



src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java
Lines 55 (patched)
<https://reviews.apache.org/r/63536/#comment268969>

    (disclaimer: see my longer comment about an integration test before trying to satisfy other comments in this file)
    
    Have a look at `FakeScheduledExecutor` and see if you find it helpful.  The API is a bit weird, but it aims to simplify this type of test where you need a `Clock` and a `ScheduledExecutor`.  At the very least, it eliminates the noise of using `Capture`s.



src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java
Lines 57 (patched)
<https://reviews.apache.org/r/63536/#comment268966>

    nit: rm line



src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java
Lines 60 (patched)
<https://reviews.apache.org/r/63536/#comment268967>

    nit: static; ditto elsewhere



src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java
Lines 74-75 (patched)
<https://reviews.apache.org/r/63536/#comment268971>

    can you swap old/new param order?  i was confused by some callers until i came up here :-)



src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java
Lines 87 (patched)
<https://reviews.apache.org/r/63536/#comment268968>

    nit: newline before



src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java
Lines 109-115 (patched)
<https://reviews.apache.org/r/63536/#comment268973>

    Test cases like this lead me to believe that an integration test is more appropriate.  This test case reads as "nothing happens when a task with reschedule=false transitions from RUNNING to PARTITIONED".  What we really mean, though, is that "a task with reschedule=false immediately moves to LOST upon transition attempt from RUNNING to PARTITIONED".
    
    With this in mind, i suggest you eliminate this test and move the coverage into `StateManagerImplTest`.  I think this is rather natural, since they work in unison to manage state transitions.
    
    I don't feel strongly about this, since it is not the convention in the codebase, and i don't know what hurdles you may encounter if you try to bundle this behavior in `StateManagerImplTest`.



src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java
Lines 431 (patched)
<https://reviews.apache.org/r/63536/#comment268975>

    Please phrase this check as:
    
    ```java
    assertEquals(
      ImmutableList.of(PENDING, ASSIGNED, ...),
      updatedTask.getTaskEvents().stream()
          .map(e -> e.getStatus())
          .collect(Collectors.toList()));
    ```
    
    This will give greater confidence that the compaction worked as intended.


- Bill Farner


On Nov. 15, 2017, 5:54 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 15, 2017, 5:54 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
>   src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
>   src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/6/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/
-----------------------------------------------------------

(Updated Nov. 16, 2017, 1:54 a.m.)


Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.


Changes
-------

Add e2e test.


Repository: aurora


Description
-------

This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#

I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.


Diffs (updated)
-----

  api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
  examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
  src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
  src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
  src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
  src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
  src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
  src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
  src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
  src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
  src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
  src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
  src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
  src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
  src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
  src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
  src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
  src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
  src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
  src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
  src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
  src/test/sh/org/apache/aurora/e2e/partition_aware.aurora PRE-CREATION 
  src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh f0819fb7ac758ad1229a76fd9794b393400e9f63 


Diff: https://reviews.apache.org/r/63536/diff/6/

Changes: https://reviews.apache.org/r/63536/diff/5-6/


Testing
-------

Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:

1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)

I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.


File Attachments
----------------

Task in PARTITIONED state
  https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
Task back into running when partition resolved
  https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
Compaction of PARTITIONED cycles (note timestamps)
  https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png


Thanks,

David McLaughlin


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review191021
-----------------------------------------------------------


Ship it!




Master (4fecf1f) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Nov. 15, 2017, 12:05 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 15, 2017, 12:05 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
>   src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/5/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review191025
-----------------------------------------------------------



I think this change is sufficiently complex that it warrants an e2e test. Will add that next.

- David McLaughlin


On Nov. 15, 2017, 12:05 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 15, 2017, 12:05 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
>   src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/5/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/
-----------------------------------------------------------

(Updated Nov. 15, 2017, 12:05 a.m.)


Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.


Changes
-------

Support:

    Job(
     ...
      partition_policy=PartitionPolicy(delay_secs=30)
      ...
    )
    
 and 
 
    Job(
     ...
      partition_policy=Disable()
      ...
    )


Repository: aurora


Description
-------

This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#

I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.


Diffs (updated)
-----

  api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
  examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
  src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
  src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
  src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
  src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
  src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
  src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
  src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
  src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
  src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
  src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
  src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
  src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
  src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
  src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
  src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
  src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
  src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
  src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
  src/test/python/apache/aurora/config/test_thrift.py 7a1567a9b67917072bb0ba3eea5857e968375f4d 


Diff: https://reviews.apache.org/r/63536/diff/5/

Changes: https://reviews.apache.org/r/63536/diff/4-5/


Testing
-------

Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:

1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)

I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.


File Attachments
----------------

Task in PARTITIONED state
  https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
Task back into running when partition resolved
  https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
Compaction of PARTITIONED cycles (note timestamps)
  https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png


Thanks,

David McLaughlin


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review191013
-----------------------------------------------------------


Ship it!




Master (4fecf1f) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Nov. 15, 2017, 7:41 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 15, 2017, 7:41 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/4/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/
-----------------------------------------------------------

(Updated Nov. 14, 2017, 10:41 p.m.)


Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.


Changes
-------

Rebase + green build. Server side is code-complete, but I'll follow up with a change to the Python DSL to match the recommendation in the proposal document.


Repository: aurora


Description
-------

This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#

I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.


Diffs (updated)
-----

  api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
  examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
  src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
  src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
  src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
  src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
  src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
  src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
  src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
  src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
  src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
  src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
  src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
  src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
  src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
  src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
  src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
  src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
  src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
  src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 


Diff: https://reviews.apache.org/r/63536/diff/4/

Changes: https://reviews.apache.org/r/63536/diff/3-4/


Testing
-------

Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:

1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)

I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.


File Attachments
----------------

Task in PARTITIONED state
  https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
Task back into running when partition resolved
  https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
Compaction of PARTITIONED cycles (note timestamps)
  https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png


Thanks,

David McLaughlin


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/
-----------------------------------------------------------

(Updated Nov. 10, 2017, 11:45 p.m.)


Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.


Changes
-------

I've added most of the Java tests. Build is broken because the logic to backport this to the MyBatis stores is missing. I don't think that is worth the effort, and will probably wait until we get rid of the db stores (which I think is happening as soon as we put 0.19 out?).


Repository: aurora


Description
-------

This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#

I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.


Diffs (updated)
-----

  api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9 
  examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
  src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
  src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
  src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
  src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
  src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
  src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
  src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1 
  src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
  src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
  src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
  src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
  src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
  src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
  src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
  src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
  src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
  src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 


Diff: https://reviews.apache.org/r/63536/diff/3/

Changes: https://reviews.apache.org/r/63536/diff/2-3/


Testing
-------

Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:

1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)

I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.


File Attachments
----------------

Task in PARTITIONED state
  https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
Task back into running when partition resolved
  https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
Compaction of PARTITIONED cycles (note timestamps)
  https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png


Thanks,

David McLaughlin


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review190519
-----------------------------------------------------------



Master (773d2d6) is red with this patch.
  ./build-support/jenkins/build.sh

                                                               ^
  required: String,String,Optional<String>,Amount<Long,Time>,boolean,boolean,boolean,Optional<String>
  found: String,String,Optional<Object>,Amount<Long,Time>,boolean,boolean,Optional<Object>
  reason: actual and formal argument lists differ in length
/home/jenkins/jenkins-slave/workspace/AuroraBot/src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java:109: error: method buildFrameworkInfo in class CommandLineDriverSettingsModule cannot be applied to given types;
    Protos.FrameworkInfo info = CommandLineDriverSettingsModule.buildFrameworkInfo(
                                                               ^
  required: String,String,Optional<String>,Amount<Long,Time>,boolean,boolean,boolean,Optional<String>
  found: String,String,Optional<Object>,Amount<Long,Time>,boolean,boolean,Optional<Object>
  reason: actual and formal argument lists differ in length
/home/jenkins/jenkins-slave/workspace/AuroraBot/src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java:125: error: method buildFrameworkInfo in class CommandLineDriverSettingsModule cannot be applied to given types;
    Protos.FrameworkInfo info = CommandLineDriverSettingsModule.buildFrameworkInfo(
                                                               ^
  required: String,String,Optional<String>,Amount<Long,Time>,boolean,boolean,boolean,Optional<String>
  found: String,String,Optional<String>,Amount<Long,Time>,boolean,boolean,Optional<Object>
  reason: actual and formal argument lists differ in length
/home/jenkins/jenkins-slave/workspace/AuroraBot/src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java:140: error: method buildFrameworkInfo in class CommandLineDriverSettingsModule cannot be applied to given types;
    Protos.FrameworkInfo info = CommandLineDriverSettingsModule.buildFrameworkInfo(
                                                               ^
  required: String,String,Optional<String>,Amount<Long,Time>,boolean,boolean,boolean,Optional<String>
  found: String,String,Optional<String>,Amount<Long,Time>,boolean,boolean,Optional<String>
  reason: actual and formal argument lists differ in length
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
5 errors
 FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':compileTestJava'.
> Compilation failed; see the compiler error output for details.

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

* Get more help at https://help.gradle.org

BUILD FAILED in 2m 47s
27 actionable tasks: 21 executed, 6 up-to-date


I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Nov. 9, 2017, 12:48 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 9, 2017, 12:48 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/2/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Bill Farner <wf...@apache.org>.

> On Nov. 9, 2017, 9:23 a.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 76 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885207#file1885207line76>
> >
> >     Thinking out loud - the `Optional.of(PARTITIONED)` parameter is already helping you perform a CAS.  If we had a _true_ CAS operation, you could very safely avoid the need to cancel futures.  You can't do that with the status only since the task can now cycle with `PARTITIONED`.
> >     
> >     What do you think about using the last transition timestamp to do this?
> >     
> >     ```java
> >     long casLastTransitionMs = Tasks.getLatestEvent(task).getTimestamp();
> >     executor.schedule(() -> {
> >       storage.write(... {
> >         if (casLastTransition == Tasks.getLatestEvent(task).getTimestamp()) {
> >           // proceed
> >         } else {
> >           // cas failed, the task has since transitioned
> >         }
> >       }
> >     
> >     }, delay, TimeUnit.SECONDS);
> >     ```
> 
> David McLaughlin wrote:
>     The other motivation (other than correctness, which I'm pretty sure we now have) for removing and cancelling futures was to get the tasks off the heap. Of course, that's not actually happening here because I'm not using a ScheduledThreadPoolExecutor (and setting removeOnCancelPolicy to true). With the AsyncUtil suggestion above, I'll be sure to set that to make it clear what I'm doing. 
>     
>     If we still think it's necessary to include this timestamp check, I'm happy to do that (although we'd have to refetch the task from storage inside the storage.write of course).

> get the tasks off the heap

Smells like premature optimization.  Either, you can certainly factor this so you only maintain a reference to the task ID and the timestamp.


- Bill


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review190607
-----------------------------------------------------------


On Nov. 8, 2017, 4:48 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 8, 2017, 4:48 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/2/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.

> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 76 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885207#file1885207line76>
> >
> >     Thinking out loud - the `Optional.of(PARTITIONED)` parameter is already helping you perform a CAS.  If we had a _true_ CAS operation, you could very safely avoid the need to cancel futures.  You can't do that with the status only since the task can now cycle with `PARTITIONED`.
> >     
> >     What do you think about using the last transition timestamp to do this?
> >     
> >     ```java
> >     long casLastTransitionMs = Tasks.getLatestEvent(task).getTimestamp();
> >     executor.schedule(() -> {
> >       storage.write(... {
> >         if (casLastTransition == Tasks.getLatestEvent(task).getTimestamp()) {
> >           // proceed
> >         } else {
> >           // cas failed, the task has since transitioned
> >         }
> >       }
> >     
> >     }, delay, TimeUnit.SECONDS);
> >     ```
> 
> David McLaughlin wrote:
>     The other motivation (other than correctness, which I'm pretty sure we now have) for removing and cancelling futures was to get the tasks off the heap. Of course, that's not actually happening here because I'm not using a ScheduledThreadPoolExecutor (and setting removeOnCancelPolicy to true). With the AsyncUtil suggestion above, I'll be sure to set that to make it clear what I'm doing. 
>     
>     If we still think it's necessary to include this timestamp check, I'm happy to do that (although we'd have to refetch the task from storage inside the storage.write of course).
> 
> Bill Farner wrote:
>     > get the tasks off the heap
>     
>     Smells like premature optimization.  Either, you can certainly factor this so you only maintain a reference to the task ID and the timestamp.

Isn't premature optimization usually brandished in response to complexity though? I don't see the complexity here.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review190607
-----------------------------------------------------------


On Nov. 9, 2017, 12:48 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 9, 2017, 12:48 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/2/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.

> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > api/src/main/thrift/org/apache/aurora/gen/api.thrift
> > Lines 526 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885203#file1885203line526>
> >
> >     It seems logical that this accounting would live alongside `ScheduledTask.failureCount`.
> >     
> >     How would you feel about naming this `timesPartitioned` instead?  I feel that more clearly disambiguates from overloaded meanings of `partition`, e.g. https://en.wikipedia.org/wiki/Partition_(database)

Thanks for this suggestion, I agree it's much clearer.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/base/Conversions.java
> > Lines 71-72 (original), 71-72 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885205#file1885205line71>
> >
> >     Comment is now stale.

Removed.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java
> > Lines 100 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885206#file1885206line100>
> >
> >     How about `-partition_aware` instead?  I find the `enable` qualifier unnecessary.

Done.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 31 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885207#file1885207line31>
> >
> >     How about s/futureMap/delayedTransitions/

Map has been removed.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 40 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885207#file1885207line40>
> >
> >     Please use `AsyncUtil.singleThreadLoggingScheduledExecutor()` instead.  In addition to automatically logging and tracking exceptions, it will create a daemon thread.

Done.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 48 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885207#file1885207line48>
> >
> >     Consider using `Tasks.getLatestEvent(task).getTimestamp()`
> >     
> >     to hide the need for `Iterables.getLast()`

Done.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 53-54 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885207#file1885207line53>
> >
> >     `String taskId = Tasks.id(stateChange.getTask());`

Done.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 76 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885207#file1885207line76>
> >
> >     Thinking out loud - the `Optional.of(PARTITIONED)` parameter is already helping you perform a CAS.  If we had a _true_ CAS operation, you could very safely avoid the need to cancel futures.  You can't do that with the status only since the task can now cycle with `PARTITIONED`.
> >     
> >     What do you think about using the last transition timestamp to do this?
> >     
> >     ```java
> >     long casLastTransitionMs = Tasks.getLatestEvent(task).getTimestamp();
> >     executor.schedule(() -> {
> >       storage.write(... {
> >         if (casLastTransition == Tasks.getLatestEvent(task).getTimestamp()) {
> >           // proceed
> >         } else {
> >           // cas failed, the task has since transitioned
> >         }
> >       }
> >     
> >     }, delay, TimeUnit.SECONDS);
> >     ```
> 
> David McLaughlin wrote:
>     The other motivation (other than correctness, which I'm pretty sure we now have) for removing and cancelling futures was to get the tasks off the heap. Of course, that's not actually happening here because I'm not using a ScheduledThreadPoolExecutor (and setting removeOnCancelPolicy to true). With the AsyncUtil suggestion above, I'll be sure to set that to make it clear what I'm doing. 
>     
>     If we still think it's necessary to include this timestamp check, I'm happy to do that (although we'd have to refetch the task from storage inside the storage.write of course).
> 
> Bill Farner wrote:
>     > get the tasks off the heap
>     
>     Smells like premature optimization.  Either, you can certainly factor this so you only maintain a reference to the task ID and the timestamp.
> 
> David McLaughlin wrote:
>     Isn't premature optimization usually brandished in response to complexity though? I don't see the complexity here.
> 
> Bill Farner wrote:
>     I'm simply pointing out that we shouldn't maintain the map if the primary motivation is to reduce heap consumption.
>     
>     I do think the code would be easier to reason about if future cancellation and maintaining the additional map were not necessary.  I don't feel strongly about this, however.

I took this suggestion and removed the map.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java
> > Lines 294 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885209#file1885209line294>
> >
> >     `==` instead of `.equals()`

Done.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java
> > Lines 297 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885209#file1885209line297>
> >
> >     This is unnecessary.

Removed.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java
> > Lines 390 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885209#file1885209line390>
> >
> >     s/The goal here is to c/C/

Done.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java
> > Lines 400 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885209#file1885209line400>
> >
> >     It's worth mentioning that this is done to bound the number of events stored for a task.

Added.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/StateModule.java
> > Lines 70 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885210#file1885210line70>
> >
> >     PubsubEventModule.bindSubscriber(binder, PartitionManager.class);

Done.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review190607
-----------------------------------------------------------


On Nov. 14, 2017, 10:41 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 14, 2017, 10:41 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9 
>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java 7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a 
>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca 
>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/4/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Bill Farner <wf...@apache.org>.

> On Nov. 9, 2017, 9:23 a.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 76 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885207#file1885207line76>
> >
> >     Thinking out loud - the `Optional.of(PARTITIONED)` parameter is already helping you perform a CAS.  If we had a _true_ CAS operation, you could very safely avoid the need to cancel futures.  You can't do that with the status only since the task can now cycle with `PARTITIONED`.
> >     
> >     What do you think about using the last transition timestamp to do this?
> >     
> >     ```java
> >     long casLastTransitionMs = Tasks.getLatestEvent(task).getTimestamp();
> >     executor.schedule(() -> {
> >       storage.write(... {
> >         if (casLastTransition == Tasks.getLatestEvent(task).getTimestamp()) {
> >           // proceed
> >         } else {
> >           // cas failed, the task has since transitioned
> >         }
> >       }
> >     
> >     }, delay, TimeUnit.SECONDS);
> >     ```
> 
> David McLaughlin wrote:
>     The other motivation (other than correctness, which I'm pretty sure we now have) for removing and cancelling futures was to get the tasks off the heap. Of course, that's not actually happening here because I'm not using a ScheduledThreadPoolExecutor (and setting removeOnCancelPolicy to true). With the AsyncUtil suggestion above, I'll be sure to set that to make it clear what I'm doing. 
>     
>     If we still think it's necessary to include this timestamp check, I'm happy to do that (although we'd have to refetch the task from storage inside the storage.write of course).
> 
> Bill Farner wrote:
>     > get the tasks off the heap
>     
>     Smells like premature optimization.  Either, you can certainly factor this so you only maintain a reference to the task ID and the timestamp.
> 
> David McLaughlin wrote:
>     Isn't premature optimization usually brandished in response to complexity though? I don't see the complexity here.

I'm simply pointing out that we shouldn't maintain the map if the primary motivation is to reduce heap consumption.

I do think the code would be easier to reason about if future cancellation and maintaining the additional map were not necessary.  I don't feel strongly about this, however.


- Bill


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review190607
-----------------------------------------------------------


On Nov. 8, 2017, 4:48 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 8, 2017, 4:48 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/2/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.

> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 76 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885207#file1885207line76>
> >
> >     Thinking out loud - the `Optional.of(PARTITIONED)` parameter is already helping you perform a CAS.  If we had a _true_ CAS operation, you could very safely avoid the need to cancel futures.  You can't do that with the status only since the task can now cycle with `PARTITIONED`.
> >     
> >     What do you think about using the last transition timestamp to do this?
> >     
> >     ```java
> >     long casLastTransitionMs = Tasks.getLatestEvent(task).getTimestamp();
> >     executor.schedule(() -> {
> >       storage.write(... {
> >         if (casLastTransition == Tasks.getLatestEvent(task).getTimestamp()) {
> >           // proceed
> >         } else {
> >           // cas failed, the task has since transitioned
> >         }
> >       }
> >     
> >     }, delay, TimeUnit.SECONDS);
> >     ```

The other motivation (other than correctness, which I'm pretty sure we now have) for removing and cancelling futures was to get the tasks off the heap. Of course, that's not actually happening here because I'm not using a ScheduledThreadPoolExecutor (and setting removeOnCancelPolicy to true). With the AsyncUtil suggestion above, I'll be sure to set that to make it clear what I'm doing. 

If we still think it's necessary to include this timestamp check, I'm happy to do that (although we'd have to refetch the task from storage inside the storage.write of course).


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review190607
-----------------------------------------------------------


On Nov. 9, 2017, 12:48 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 9, 2017, 12:48 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/2/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Bill Farner <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review190607
-----------------------------------------------------------



Overall approach appears sound.  LGTM to proceed to tests.


api/src/main/thrift/org/apache/aurora/gen/api.thrift
Lines 526 (patched)
<https://reviews.apache.org/r/63536/#comment268091>

    It seems logical that this accounting would live alongside `ScheduledTask.failureCount`.
    
    How would you feel about naming this `timesPartitioned` instead?  I feel that more clearly disambiguates from overloaded meanings of `partition`, e.g. https://en.wikipedia.org/wiki/Partition_(database)



src/main/java/org/apache/aurora/scheduler/base/Conversions.java
Lines 71-72 (original), 71-72 (patched)
<https://reviews.apache.org/r/63536/#comment268092>

    Comment is now stale.



src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java
Lines 100 (patched)
<https://reviews.apache.org/r/63536/#comment268093>

    How about `-partition_aware` instead?  I find the `enable` qualifier unnecessary.



src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
Lines 31 (patched)
<https://reviews.apache.org/r/63536/#comment268094>

    How about s/futureMap/delayedTransitions/



src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
Lines 40 (patched)
<https://reviews.apache.org/r/63536/#comment268096>

    Please use `AsyncUtil.singleThreadLoggingScheduledExecutor()` instead.  In addition to automatically logging and tracking exceptions, it will create a daemon thread.



src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
Lines 48 (patched)
<https://reviews.apache.org/r/63536/#comment268097>

    Consider using `Tasks.getLatestEvent(task).getTimestamp()`
    
    to hide the need for `Iterables.getLast()`



src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
Lines 53-54 (patched)
<https://reviews.apache.org/r/63536/#comment268100>

    `String taskId = Tasks.id(stateChange.getTask());`



src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
Lines 69 (patched)
<https://reviews.apache.org/r/63536/#comment268103>

    Readability nit:
    ```java
    if (!stateChange.isTransition()) {
      // This is a storage recovery event, so the calculated time
      // may be in the past.
      delay = Math.max(...);
    }
    ```



src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
Lines 76 (patched)
<https://reviews.apache.org/r/63536/#comment268109>

    Thinking out loud - the `Optional.of(PARTITIONED)` parameter is already helping you perform a CAS.  If we had a _true_ CAS operation, you could very safely avoid the need to cancel futures.  You can't do that with the status only since the task can now cycle with `PARTITIONED`.
    
    What do you think about using the last transition timestamp to do this?
    
    ```java
    long casLastTransitionMs = Tasks.getLatestEvent(task).getTimestamp();
    executor.schedule(() -> {
      storage.write(... {
        if (casLastTransition == Tasks.getLatestEvent(task).getTimestamp()) {
          // proceed
        } else {
          // cas failed, the task has since transitioned
        }
      }
    
    }, delay, TimeUnit.SECONDS);
    ```



src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java
Lines 294 (patched)
<https://reviews.apache.org/r/63536/#comment268110>

    `==` instead of `.equals()`



src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java
Lines 297 (patched)
<https://reviews.apache.org/r/63536/#comment268111>

    This is unnecessary.



src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java
Lines 390 (patched)
<https://reviews.apache.org/r/63536/#comment268113>

    s/The goal here is to c/C/



src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java
Lines 400 (patched)
<https://reviews.apache.org/r/63536/#comment268114>

    It's worth mentioning that this is done to bound the number of events stored for a task.



src/main/java/org/apache/aurora/scheduler/state/StateModule.java
Lines 70 (patched)
<https://reviews.apache.org/r/63536/#comment268115>

    PubsubEventModule.bindSubscriber(binder, PartitionManager.class);


- Bill Farner


On Nov. 8, 2017, 4:48 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 8, 2017, 4:48 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/2/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/
-----------------------------------------------------------

(Updated Nov. 9, 2017, 12:48 a.m.)


Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.


Changes
-------

Added logic to handle failovers / dealing with timer race conditions.


Repository: aurora


Description
-------

This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#

I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.


Diffs (updated)
-----

  api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9 
  examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
  src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
  src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
  src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
  src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
  src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1 
  src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
  src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
  src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
  src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 


Diff: https://reviews.apache.org/r/63536/diff/2/

Changes: https://reviews.apache.org/r/63536/diff/1-2/


Testing
-------

Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:

1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)

I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.


File Attachments
----------------

Task in PARTITIONED state
  https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
Task back into running when partition resolved
  https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
Compaction of PARTITIONED cycles (note timestamps)
  https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png


Thanks,

David McLaughlin


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review190359
-----------------------------------------------------------



Master (ad86177) is red with this patch.
  ./build-support/jenkins/build.sh

                                                               ^
  required: String,String,Optional<String>,Amount<Long,Time>,boolean,boolean,boolean,Optional<String>
  found: String,String,Optional<Object>,Amount<Long,Time>,boolean,boolean,Optional<Object>
  reason: actual and formal argument lists differ in length
/home/jenkins/jenkins-slave/workspace/AuroraBot/src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java:109: error: method buildFrameworkInfo in class CommandLineDriverSettingsModule cannot be applied to given types;
    Protos.FrameworkInfo info = CommandLineDriverSettingsModule.buildFrameworkInfo(
                                                               ^
  required: String,String,Optional<String>,Amount<Long,Time>,boolean,boolean,boolean,Optional<String>
  found: String,String,Optional<Object>,Amount<Long,Time>,boolean,boolean,Optional<Object>
  reason: actual and formal argument lists differ in length
/home/jenkins/jenkins-slave/workspace/AuroraBot/src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java:125: error: method buildFrameworkInfo in class CommandLineDriverSettingsModule cannot be applied to given types;
    Protos.FrameworkInfo info = CommandLineDriverSettingsModule.buildFrameworkInfo(
                                                               ^
  required: String,String,Optional<String>,Amount<Long,Time>,boolean,boolean,boolean,Optional<String>
  found: String,String,Optional<String>,Amount<Long,Time>,boolean,boolean,Optional<Object>
  reason: actual and formal argument lists differ in length
/home/jenkins/jenkins-slave/workspace/AuroraBot/src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java:140: error: method buildFrameworkInfo in class CommandLineDriverSettingsModule cannot be applied to given types;
    Protos.FrameworkInfo info = CommandLineDriverSettingsModule.buildFrameworkInfo(
                                                               ^
  required: String,String,Optional<String>,Amount<Long,Time>,boolean,boolean,boolean,Optional<String>
  found: String,String,Optional<String>,Amount<Long,Time>,boolean,boolean,Optional<String>
  reason: actual and formal argument lists differ in length
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
5 errors
 FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':compileTestJava'.
> Compilation failed; see the compiler error output for details.

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

* Get more help at https://help.gradle.org

BUILD FAILED in 3m 44s
27 actionable tasks: 21 executed, 6 up-to-date


I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Nov. 7, 2017, 12:28 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2017, 12:28 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/1/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.

> On Nov. 7, 2017, 11:36 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 38-42 (patched)
> > <https://reviews.apache.org/r/63536/diff/1/?file=1884100#file1884100line38>
> >
> >     On the other hand, this would be consistent from the user perspective. A user only sees the compacted task events with a single RUNNING to PARTITIONED transition. 
> >     
> >     If we cancel the future, he might wonder why the task does not end up as LOST.

Maybe it's not clear without tests - but this timer is responsible for moving from PARTITIONED -> LOST after a specified delay. This comment is concerned about the user saying "give up on a partition after 15 minutes" and then a task moving from RUNNING -> PARTITIONED -> RUNNING -> PARTITIONED within a 15 minute window. When the task transitions back to RUNNING the first time, the user would expect the timer to be reset.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review190383
-----------------------------------------------------------


On Nov. 7, 2017, 7:28 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2017, 7:28 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/1/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.

> On Nov. 7, 2017, 11:36 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 38-42 (patched)
> > <https://reviews.apache.org/r/63536/diff/1/?file=1884100#file1884100line38>
> >
> >     On the other hand, this would be consistent from the user perspective. A user only sees the compacted task events with a single RUNNING to PARTITIONED transition. 
> >     
> >     If we cancel the future, he might wonder why the task does not end up as LOST.
> 
> David McLaughlin wrote:
>     Maybe it's not clear without tests - but this timer is responsible for moving from PARTITIONED -> LOST after a specified delay. This comment is concerned about the user saying "give up on a partition after 15 minutes" and then a task moving from RUNNING -> PARTITIONED -> RUNNING -> PARTITIONED within a 15 minute window. When the task transitions back to RUNNING the first time, the user would expect the timer to be reset.
> 
> David McLaughlin wrote:
>     It also just occured to me this timer won't survive a restart. So if I want to use this event-driven model, then I'll need to plug into the storage restore operation too.

Jordan pointed out that it will survive a restart as the storage layer sends all the events on recovery (which I should have remembered from WebHooks!). So I just need to check each state change for isTransition and then change delay to depend on the latest task event time.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review190383
-----------------------------------------------------------


On Nov. 7, 2017, 7:28 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2017, 7:28 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/1/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by David McLaughlin <da...@dmclaughlin.com>.

> On Nov. 7, 2017, 11:36 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 38-42 (patched)
> > <https://reviews.apache.org/r/63536/diff/1/?file=1884100#file1884100line38>
> >
> >     On the other hand, this would be consistent from the user perspective. A user only sees the compacted task events with a single RUNNING to PARTITIONED transition. 
> >     
> >     If we cancel the future, he might wonder why the task does not end up as LOST.
> 
> David McLaughlin wrote:
>     Maybe it's not clear without tests - but this timer is responsible for moving from PARTITIONED -> LOST after a specified delay. This comment is concerned about the user saying "give up on a partition after 15 minutes" and then a task moving from RUNNING -> PARTITIONED -> RUNNING -> PARTITIONED within a 15 minute window. When the task transitions back to RUNNING the first time, the user would expect the timer to be reset.

It also just occured to me this timer won't survive a restart. So if I want to use this event-driven model, then I'll need to plug into the storage restore operation too.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review190383
-----------------------------------------------------------


On Nov. 7, 2017, 7:28 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2017, 7:28 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/1/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs

Posted by Stephan Erb <se...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review190383
-----------------------------------------------------------




src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
Lines 38-42 (patched)
<https://reviews.apache.org/r/63536/#comment267715>

    On the other hand, this would be consistent from the user perspective. A user only sees the compacted task events with a single RUNNING to PARTITIONED transition. 
    
    If we cancel the future, he might wonder why the task does not end up as LOST.


- Stephan Erb


On Nov. 7, 2017, 8:28 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2017, 8:28 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9 
>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 
>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 
>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a 
>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 
>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1 
>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 
>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc 
>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e 
>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c 
> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/1/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>