You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@aurora.apache.org by Mehrdad Nurolahzade <me...@nurolahzade.com> on 2017/01/04 17:16:53 UTC

Review Request 55179: AURORA-1820 Reduce storage write lock contention by adopting Double-Checked Locking pattern in TimedOutTaskHandler

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55179/
-----------------------------------------------------------

Review request for Aurora, Joshua Cohen and Stephan Erb.


Bugs: AURORA-1820
    https://issues.apache.org/jira/browse/AURORA-1820


Repository: aurora


Description
-------

`TimedOutTaskHandler` acquires storage write lock for every task every time they transition to a transient state. It then verifies after a default time-out period of 5 minutes if the task has transitioned out of the transient state.

The verification step takes place while holding the storage write lock. In over 99% of cases the logic short-circuits and returns from `StateManagerImpl.updateTaskAndExternalState()` once it learns task has transitioned out of the transient state.

This patch reduces storage write lock contention by adopting Double-Checked Locking pattern in `TimedOutTaskHandler.run()`.


Diffs
-----

  src/main/java/org/apache/aurora/scheduler/reconciliation/TaskTimeout.java 2dc9bc2c6916595270187f0f29d5bd8c5ba7e9ad 
  src/test/java/org/apache/aurora/scheduler/reconciliation/TaskTimeoutTest.java 1006ddb6caea015c2d4e014bd044f2933541c84f 

Diff: https://reviews.apache.org/r/55179/diff/


Testing
-------

```
./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh

...

*** OK (All tests passed) ***

mesos-master start/running, process 22759
+ RETCODE=0
+ restore_netrc
+ mv /home/vagrant/.netrc.bak /home/vagrant/.netrc
+ true
Connection to 127.0.0.1 closed.

real	25m36.144s
user	0m1.358s
sys	0m0.595s
```


Thanks,

Mehrdad Nurolahzade


Re: Review Request 55179: AURORA-1820 Reduce storage write lock contention by adopting Double-Checked Locking pattern in TimedOutTaskHandler

Posted by Joshua Cohen <jc...@apache.org>.

> On Jan. 4, 2017, 5:39 p.m., Stephan Erb wrote:
> > src/test/java/org/apache/aurora/scheduler/reconciliation/TaskTimeoutTest.java, line 130
> > <https://reviews.apache.org/r/55179/diff/1/?file=1596711#file1596711line130>
> >
> >     Given the motivation of the patch, we should probably check that we did not try to acquire the storage lock.

+1


- Joshua


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55179/#review160510
-----------------------------------------------------------


On Jan. 4, 2017, 5:16 p.m., Mehrdad Nurolahzade wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55179/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 5:16 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Stephan Erb.
> 
> 
> Bugs: AURORA-1820
>     https://issues.apache.org/jira/browse/AURORA-1820
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> `TimedOutTaskHandler` acquires storage write lock for every task every time they transition to a transient state. It then verifies after a default time-out period of 5 minutes if the task has transitioned out of the transient state.
> 
> The verification step takes place while holding the storage write lock. In over 99% of cases the logic short-circuits and returns from `StateManagerImpl.updateTaskAndExternalState()` once it learns task has transitioned out of the transient state.
> 
> This patch reduces storage write lock contention by adopting Double-Checked Locking pattern in `TimedOutTaskHandler.run()`.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/reconciliation/TaskTimeout.java 2dc9bc2c6916595270187f0f29d5bd8c5ba7e9ad 
>   src/test/java/org/apache/aurora/scheduler/reconciliation/TaskTimeoutTest.java 1006ddb6caea015c2d4e014bd044f2933541c84f 
> 
> Diff: https://reviews.apache.org/r/55179/diff/
> 
> 
> Testing
> -------
> 
> ```
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> ...
> 
> *** OK (All tests passed) ***
> 
> mesos-master start/running, process 22759
> + RETCODE=0
> + restore_netrc
> + mv /home/vagrant/.netrc.bak /home/vagrant/.netrc
> + true
> Connection to 127.0.0.1 closed.
> 
> real	25m36.144s
> user	0m1.358s
> sys	0m0.595s
> ```
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>


Re: Review Request 55179: AURORA-1820 Reduce storage write lock contention by adopting Double-Checked Locking pattern in TimedOutTaskHandler

Posted by Stephan Erb <se...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55179/#review160510
-----------------------------------------------------------


Fix it, then Ship it!




LGTM.


src/test/java/org/apache/aurora/scheduler/reconciliation/TaskTimeoutTest.java (line 130)
<https://reviews.apache.org/r/55179/#comment231646>

    Given the motivation of the patch, we should probably check that we did not try to acquire the storage lock.


- Stephan Erb


On Jan. 4, 2017, 6:16 p.m., Mehrdad Nurolahzade wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55179/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 6:16 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Stephan Erb.
> 
> 
> Bugs: AURORA-1820
>     https://issues.apache.org/jira/browse/AURORA-1820
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> `TimedOutTaskHandler` acquires storage write lock for every task every time they transition to a transient state. It then verifies after a default time-out period of 5 minutes if the task has transitioned out of the transient state.
> 
> The verification step takes place while holding the storage write lock. In over 99% of cases the logic short-circuits and returns from `StateManagerImpl.updateTaskAndExternalState()` once it learns task has transitioned out of the transient state.
> 
> This patch reduces storage write lock contention by adopting Double-Checked Locking pattern in `TimedOutTaskHandler.run()`.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/reconciliation/TaskTimeout.java 2dc9bc2c6916595270187f0f29d5bd8c5ba7e9ad 
>   src/test/java/org/apache/aurora/scheduler/reconciliation/TaskTimeoutTest.java 1006ddb6caea015c2d4e014bd044f2933541c84f 
> 
> Diff: https://reviews.apache.org/r/55179/diff/
> 
> 
> Testing
> -------
> 
> ```
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> ...
> 
> *** OK (All tests passed) ***
> 
> mesos-master start/running, process 22759
> + RETCODE=0
> + restore_netrc
> + mv /home/vagrant/.netrc.bak /home/vagrant/.netrc
> + true
> Connection to 127.0.0.1 closed.
> 
> real	25m36.144s
> user	0m1.358s
> sys	0m0.595s
> ```
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>


Re: Review Request 55179: AURORA-1820 Reduce storage write lock contention by adopting Double-Checked Locking pattern in TimedOutTaskHandler

Posted by Joshua Cohen <jc...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55179/#review160515
-----------------------------------------------------------


Ship it!




LGTM pending additional testing per Stephan's comments.

- Joshua Cohen


On Jan. 4, 2017, 5:16 p.m., Mehrdad Nurolahzade wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55179/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 5:16 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Stephan Erb.
> 
> 
> Bugs: AURORA-1820
>     https://issues.apache.org/jira/browse/AURORA-1820
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> `TimedOutTaskHandler` acquires storage write lock for every task every time they transition to a transient state. It then verifies after a default time-out period of 5 minutes if the task has transitioned out of the transient state.
> 
> The verification step takes place while holding the storage write lock. In over 99% of cases the logic short-circuits and returns from `StateManagerImpl.updateTaskAndExternalState()` once it learns task has transitioned out of the transient state.
> 
> This patch reduces storage write lock contention by adopting Double-Checked Locking pattern in `TimedOutTaskHandler.run()`.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/reconciliation/TaskTimeout.java 2dc9bc2c6916595270187f0f29d5bd8c5ba7e9ad 
>   src/test/java/org/apache/aurora/scheduler/reconciliation/TaskTimeoutTest.java 1006ddb6caea015c2d4e014bd044f2933541c84f 
> 
> Diff: https://reviews.apache.org/r/55179/diff/
> 
> 
> Testing
> -------
> 
> ```
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> ...
> 
> *** OK (All tests passed) ***
> 
> mesos-master start/running, process 22759
> + RETCODE=0
> + restore_netrc
> + mv /home/vagrant/.netrc.bak /home/vagrant/.netrc
> + true
> Connection to 127.0.0.1 closed.
> 
> real	25m36.144s
> user	0m1.358s
> sys	0m0.595s
> ```
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>


Re: Review Request 55179: AURORA-1820 Reduce storage write lock contention by adopting Double-Checked Locking pattern in TimedOutTaskHandler

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55179/#review160540
-----------------------------------------------------------


Ship it!




Master (21ad18e) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Jan. 4, 2017, 9:23 p.m., Mehrdad Nurolahzade wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55179/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 9:23 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Stephan Erb.
> 
> 
> Bugs: AURORA-1820
>     https://issues.apache.org/jira/browse/AURORA-1820
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> `TimedOutTaskHandler` acquires storage write lock for every task every time they transition to a transient state. It then verifies after a default time-out period of 5 minutes if the task has transitioned out of the transient state.
> 
> The verification step takes place while holding the storage write lock. In over 99% of cases the logic short-circuits and returns from `StateManagerImpl.updateTaskAndExternalState()` once it learns task has transitioned out of the transient state.
> 
> This patch reduces storage write lock contention by adopting Double-Checked Locking pattern in `TimedOutTaskHandler.run()`.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/reconciliation/TaskTimeout.java 2dc9bc2c6916595270187f0f29d5bd8c5ba7e9ad 
>   src/test/java/org/apache/aurora/scheduler/reconciliation/TaskTimeoutTest.java 1006ddb6caea015c2d4e014bd044f2933541c84f 
> 
> Diff: https://reviews.apache.org/r/55179/diff/
> 
> 
> Testing
> -------
> 
> ```
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> ...
> 
> *** OK (All tests passed) ***
> 
> mesos-master start/running, process 22759
> + RETCODE=0
> + restore_netrc
> + mv /home/vagrant/.netrc.bak /home/vagrant/.netrc
> + true
> Connection to 127.0.0.1 closed.
> 
> real	25m36.144s
> user	0m1.358s
> sys	0m0.595s
> ```
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>


Re: Review Request 55179: AURORA-1820 Reduce storage write lock contention by adopting Double-Checked Locking pattern in TimedOutTaskHandler

Posted by Mehrdad Nurolahzade <me...@nurolahzade.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55179/
-----------------------------------------------------------

(Updated Jan. 4, 2017, 1:23 p.m.)


Review request for Aurora, Joshua Cohen and Stephan Erb.


Changes
-------

Added expectations for storage calls


Bugs: AURORA-1820
    https://issues.apache.org/jira/browse/AURORA-1820


Repository: aurora


Description
-------

`TimedOutTaskHandler` acquires storage write lock for every task every time they transition to a transient state. It then verifies after a default time-out period of 5 minutes if the task has transitioned out of the transient state.

The verification step takes place while holding the storage write lock. In over 99% of cases the logic short-circuits and returns from `StateManagerImpl.updateTaskAndExternalState()` once it learns task has transitioned out of the transient state.

This patch reduces storage write lock contention by adopting Double-Checked Locking pattern in `TimedOutTaskHandler.run()`.


Diffs (updated)
-----

  src/main/java/org/apache/aurora/scheduler/reconciliation/TaskTimeout.java 2dc9bc2c6916595270187f0f29d5bd8c5ba7e9ad 
  src/test/java/org/apache/aurora/scheduler/reconciliation/TaskTimeoutTest.java 1006ddb6caea015c2d4e014bd044f2933541c84f 

Diff: https://reviews.apache.org/r/55179/diff/


Testing
-------

```
./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh

...

*** OK (All tests passed) ***

mesos-master start/running, process 22759
+ RETCODE=0
+ restore_netrc
+ mv /home/vagrant/.netrc.bak /home/vagrant/.netrc
+ true
Connection to 127.0.0.1 closed.

real	25m36.144s
user	0m1.358s
sys	0m0.595s
```


Thanks,

Mehrdad Nurolahzade


Re: Review Request 55179: AURORA-1820 Reduce storage write lock contention by adopting Double-Checked Locking pattern in TimedOutTaskHandler

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55179/#review160512
-----------------------------------------------------------


Ship it!




Master (21ad18e) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Jan. 4, 2017, 5:16 p.m., Mehrdad Nurolahzade wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55179/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 5:16 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Stephan Erb.
> 
> 
> Bugs: AURORA-1820
>     https://issues.apache.org/jira/browse/AURORA-1820
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> `TimedOutTaskHandler` acquires storage write lock for every task every time they transition to a transient state. It then verifies after a default time-out period of 5 minutes if the task has transitioned out of the transient state.
> 
> The verification step takes place while holding the storage write lock. In over 99% of cases the logic short-circuits and returns from `StateManagerImpl.updateTaskAndExternalState()` once it learns task has transitioned out of the transient state.
> 
> This patch reduces storage write lock contention by adopting Double-Checked Locking pattern in `TimedOutTaskHandler.run()`.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/reconciliation/TaskTimeout.java 2dc9bc2c6916595270187f0f29d5bd8c5ba7e9ad 
>   src/test/java/org/apache/aurora/scheduler/reconciliation/TaskTimeoutTest.java 1006ddb6caea015c2d4e014bd044f2933541c84f 
> 
> Diff: https://reviews.apache.org/r/55179/diff/
> 
> 
> Testing
> -------
> 
> ```
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> ...
> 
> *** OK (All tests passed) ***
> 
> mesos-master start/running, process 22759
> + RETCODE=0
> + restore_netrc
> + mv /home/vagrant/.netrc.bak /home/vagrant/.netrc
> + true
> Connection to 127.0.0.1 closed.
> 
> real	25m36.144s
> user	0m1.358s
> sys	0m0.595s
> ```
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>