You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@aurora.apache.org by Jordan Ly <jo...@gmail.com> on 2018/02/16 02:54:48 UTC

Review Request 65680: Fix cron id collision bug by avoiding state in Quartz jobs

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65680/
-----------------------------------------------------------

Review request for Aurora, David McLaughlin, Renan DelValle, and Santhosh Kumar Shanmugham.


Repository: aurora


Description
-------

There is a pretty rare situation that can occur that will cause the scheduler to crash.

Stacktrace:
```
Feb 16, 2018 12:22:36 AM com.google.common.util.concurrent.ServiceManager$ServiceListener failed
SEVERE: Service CronBatchWorker [FAILED] has failed in the RUNNING state.
java.lang.IllegalArgumentException: Instance ID collision detected.
        at org.apache.aurora.scheduler.state.StateManagerImpl.insertPendingTasks(StateManagerImpl.java:126)
        at org.apache.aurora.scheduler.cron.quartz.AuroraCronJob.lambda$null$0(AuroraCronJob.java:213)
        at org.apache.aurora.scheduler.BatchWorker.lambda$processBatch$3(BatchWorker.java:210)
        at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:144)
        at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:139)
        at org.apache.aurora.scheduler.storage.durability.DurableStorage.lambda$doInTransaction$0(DurableStorage.java:199)
        at org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:89)
        at org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
        at org.apache.aurora.scheduler.storage.durability.DurableStorage.doInTransaction(DurableStorage.java:198)
        at org.apache.aurora.scheduler.storage.durability.DurableStorage.write(DurableStorage.java:221)
        at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.write(CallOrderEnforcingStorage.java:132)
        at org.apache.aurora.scheduler.BatchWorker.processBatch(BatchWorker.java:207)
        at org.apache.aurora.scheduler.BatchWorker.run(BatchWorker.java:199)
        at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
        at com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
        at java.lang.Thread.run(Thread.java:748)
```

The steps are:
```
1. Schedule and start a cron (runs every minute, graceful shutdown period > 1 minute)
2. Perform 2 runs of the cron
3. Deschedule the cron
4. Reschedule the cron
5. Perform 3 runs of the cron
6. Scheduler will crash on the 3rd run due to an ID collision between the already running cron and a new cron trying to start
```
The reason for this bug is that some state is persisted between cron scheduling/descheduling via `killFollowups`. We use Quartz `JobDataMap` to hold a "work in progress" token, while the `killFollowups` set indicates "completion" in order to ensure there are no concurrent runs. Descheduling a cron will remove the "work in progress" token while ignoring the "completion" token in `killFollowups`. Later, a "work in progress" token may be added and a "completion" token may be seen mistakenly from a previous schedule, causing a concurrent run.

For the example above, the runs in step 2 will add the key to the set to show that all runs are finished and another run can start. The 3rd run in step 5 will mistakenly see that the 2nd run has started and finished since the "completion" token was preserved from the first set of runs in step 2. This will erroneously trigger a concurrent run causing a ID collision.

We should not preserve any state between cron scheduling/descheduling outside of the given Quartz `JobDataMap` abstraction. We can use the presence of a value here to achieve the same thing as `killFollowups`.


Diffs
-----

  src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 3604dd4c84043def3fe098ca0f87bf8bab99b3db 
  src/test/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJobTest.java 5083fcd407a069617634b82db1ad799b5d4d4551 


Diff: https://reviews.apache.org/r/65680/diff/1/


Testing
-------

Repro'd bug using steps above. Validated code changes solved issue.

Added small change in unit test to ensure state is now preserved in the Quartz job context.

`./gradlew test`


Thanks,

Jordan Ly


Re: Review Request 65680: Fix cron id collision bug by avoiding state in Quartz jobs

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65680/#review197657
-----------------------------------------------------------



Master (59f66ff) is red with this patch.
  ./build-support/jenkins/build.sh

 [189] ./src/main/resources/source-sans-pro.css 1.05 kB {0} [built]
 [221] ./src/main/js/index.js 3.43 kB {0} [built]
    + 266 hidden modules
:processResources
:classes
:jar
:moreStartScripts
:startScripts
:distTar
:distZip
:assemble
:compileTestJavaNote: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: /home/jenkins/jenkins-slave/workspace/AuroraBot/src/test/java/org/apache/aurora/scheduler/storage/durability/DurableStorageTest.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.

:processTestResources
:testClasses
:compileJmhJavaNote: /home/jenkins/jenkins-slave/workspace/AuroraBot/src/jmh/java/org/apache/aurora/benchmark/fakes/FakeSchedulerDriver.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

:processJmhResources NO-SOURCE
:jmhClasses
:checkstyleJmh
:checkstyleMain[ant:checkstyle] [ERROR] /home/jenkins/jenkins-slave/workspace/AuroraBot/src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java:30:8: Unused import - com.google.common.collect.Sets. [UnusedImports]
 FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':checkstyleMain'.
> Checkstyle rule violations were found. See the report at: file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/checkstyle/main.html

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

* Get more help at https://help.gradle.org

BUILD FAILED in 4m 45s
32 actionable tasks: 26 executed, 6 up-to-date


I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Feb. 15, 2018, 6:54 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65680/
> -----------------------------------------------------------
> 
> (Updated Feb. 15, 2018, 6:54 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Renan DelValle, and Santhosh Kumar Shanmugham.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> There is a pretty rare situation that can occur that will cause the scheduler to crash.
> 
> Stacktrace:
> ```
> Feb 16, 2018 12:22:36 AM com.google.common.util.concurrent.ServiceManager$ServiceListener failed
> SEVERE: Service CronBatchWorker [FAILED] has failed in the RUNNING state.
> java.lang.IllegalArgumentException: Instance ID collision detected.
>         at org.apache.aurora.scheduler.state.StateManagerImpl.insertPendingTasks(StateManagerImpl.java:126)
>         at org.apache.aurora.scheduler.cron.quartz.AuroraCronJob.lambda$null$0(AuroraCronJob.java:213)
>         at org.apache.aurora.scheduler.BatchWorker.lambda$processBatch$3(BatchWorker.java:210)
>         at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:144)
>         at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:139)
>         at org.apache.aurora.scheduler.storage.durability.DurableStorage.lambda$doInTransaction$0(DurableStorage.java:199)
>         at org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:89)
>         at org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
>         at org.apache.aurora.scheduler.storage.durability.DurableStorage.doInTransaction(DurableStorage.java:198)
>         at org.apache.aurora.scheduler.storage.durability.DurableStorage.write(DurableStorage.java:221)
>         at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.write(CallOrderEnforcingStorage.java:132)
>         at org.apache.aurora.scheduler.BatchWorker.processBatch(BatchWorker.java:207)
>         at org.apache.aurora.scheduler.BatchWorker.run(BatchWorker.java:199)
>         at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
>         at com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
>         at java.lang.Thread.run(Thread.java:748)
> ```
> 
> The steps are:
> ```
> 1. Schedule and start a cron (runs every minute, graceful shutdown period > 1 minute)
> 2. Perform 2 runs of the cron
> 3. Deschedule the cron
> 4. Reschedule the cron
> 5. Perform 3 runs of the cron
> 6. Scheduler will crash on the 3rd run due to an ID collision between the already running cron and a new cron trying to start
> ```
> The reason for this bug is that some state is persisted between cron scheduling/descheduling via `killFollowups`. We use Quartz `JobDataMap` to hold a "work in progress" token, while the `killFollowups` set indicates "completion" in order to ensure there are no concurrent runs. Descheduling a cron will remove the "work in progress" token while ignoring the "completion" token in `killFollowups`. Later, a "work in progress" token may be added and a "completion" token may be seen mistakenly from a previous schedule, causing a concurrent run.
> 
> For the example above, the runs in step 2 will add the key to the set to show that all runs are finished and another run can start. The 3rd run in step 5 will mistakenly see that the 2nd run has started and finished since the "completion" token was preserved from the first set of runs in step 2. This will erroneously trigger a concurrent run causing a ID collision.
> 
> We should not preserve any state between cron scheduling/descheduling outside of the given Quartz `JobDataMap` abstraction. We can use the presence of a value here to achieve the same thing as `killFollowups`.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 3604dd4c84043def3fe098ca0f87bf8bab99b3db 
>   src/test/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJobTest.java 5083fcd407a069617634b82db1ad799b5d4d4551 
> 
> 
> Diff: https://reviews.apache.org/r/65680/diff/1/
> 
> 
> Testing
> -------
> 
> Repro'd bug using steps above. Validated code changes solved issue.
> 
> Added small change in unit test to ensure state is now preserved in the Quartz job context.
> 
> `./gradlew test`
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 65680: Fix cron id collision bug by avoiding state in Quartz jobs

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65680/#review197706
-----------------------------------------------------------


Ship it!




Master (59f66ff) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Feb. 17, 2018, 1:22 a.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65680/
> -----------------------------------------------------------
> 
> (Updated Feb. 17, 2018, 1:22 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Renan DelValle, and Santhosh Kumar Shanmugham.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> There is a pretty rare situation that can occur that will cause the scheduler to crash.
> 
> Stacktrace:
> ```
> Feb 16, 2018 12:22:36 AM com.google.common.util.concurrent.ServiceManager$ServiceListener failed
> SEVERE: Service CronBatchWorker [FAILED] has failed in the RUNNING state.
> java.lang.IllegalArgumentException: Instance ID collision detected.
>         at org.apache.aurora.scheduler.state.StateManagerImpl.insertPendingTasks(StateManagerImpl.java:126)
>         at org.apache.aurora.scheduler.cron.quartz.AuroraCronJob.lambda$null$0(AuroraCronJob.java:213)
>         at org.apache.aurora.scheduler.BatchWorker.lambda$processBatch$3(BatchWorker.java:210)
>         at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:144)
>         at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:139)
>         at org.apache.aurora.scheduler.storage.durability.DurableStorage.lambda$doInTransaction$0(DurableStorage.java:199)
>         at org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:89)
>         at org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
>         at org.apache.aurora.scheduler.storage.durability.DurableStorage.doInTransaction(DurableStorage.java:198)
>         at org.apache.aurora.scheduler.storage.durability.DurableStorage.write(DurableStorage.java:221)
>         at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.write(CallOrderEnforcingStorage.java:132)
>         at org.apache.aurora.scheduler.BatchWorker.processBatch(BatchWorker.java:207)
>         at org.apache.aurora.scheduler.BatchWorker.run(BatchWorker.java:199)
>         at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
>         at com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
>         at java.lang.Thread.run(Thread.java:748)
> ```
> 
> The steps are:
> ```
> 1. Schedule and start a cron (runs every minute, graceful shutdown period > 1 minute)
> 2. Perform 2 runs of the cron
> 3. Deschedule the cron
> 4. Reschedule the cron
> 5. Perform 3 runs of the cron
> 6. Scheduler will crash on the 3rd run due to an ID collision between the already running cron and a new cron trying to start
> ```
> The reason for this bug is that some state is persisted between cron scheduling/descheduling via `killFollowups`. We use Quartz `JobDataMap` to hold a "work in progress" token, while the `killFollowups` set indicates "completion" in order to ensure there are no concurrent runs. Descheduling a cron will remove the "work in progress" token while ignoring the "completion" token in `killFollowups`. Later, a "work in progress" token may be added and a "completion" token may be seen mistakenly from a previous schedule, causing a concurrent run.
> 
> For the example above, the runs in step 2 will add the key to the set to show that all runs are finished and another run can start. The 3rd run in step 5 will mistakenly see that the 2nd run has started and finished since the "completion" token was preserved from the first set of runs in step 2. This will erroneously trigger a concurrent run causing a ID collision.
> 
> We should not preserve any state between cron scheduling/descheduling outside of the given Quartz `JobDataMap` abstraction. We can use the presence of a value here to achieve the same thing as `killFollowups`.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 3604dd4c84043def3fe098ca0f87bf8bab99b3db 
>   src/test/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJobTest.java 5083fcd407a069617634b82db1ad799b5d4d4551 
> 
> 
> Diff: https://reviews.apache.org/r/65680/diff/3/
> 
> 
> Testing
> -------
> 
> Repro'd bug using steps above. Validated code changes solved issue.
> 
> Added small change in unit test to ensure state is now preserved in the Quartz job context.
> 
> `./gradlew test`
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 65680: Fix cron id collision bug by avoiding state in Quartz jobs

Posted by Jordan Ly <jo...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65680/
-----------------------------------------------------------

(Updated Feb. 17, 2018, 1:22 a.m.)


Review request for Aurora, David McLaughlin, Renan DelValle, and Santhosh Kumar Shanmugham.


Changes
-------

Fix PMD rule violation


Repository: aurora


Description
-------

There is a pretty rare situation that can occur that will cause the scheduler to crash.

Stacktrace:
```
Feb 16, 2018 12:22:36 AM com.google.common.util.concurrent.ServiceManager$ServiceListener failed
SEVERE: Service CronBatchWorker [FAILED] has failed in the RUNNING state.
java.lang.IllegalArgumentException: Instance ID collision detected.
        at org.apache.aurora.scheduler.state.StateManagerImpl.insertPendingTasks(StateManagerImpl.java:126)
        at org.apache.aurora.scheduler.cron.quartz.AuroraCronJob.lambda$null$0(AuroraCronJob.java:213)
        at org.apache.aurora.scheduler.BatchWorker.lambda$processBatch$3(BatchWorker.java:210)
        at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:144)
        at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:139)
        at org.apache.aurora.scheduler.storage.durability.DurableStorage.lambda$doInTransaction$0(DurableStorage.java:199)
        at org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:89)
        at org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
        at org.apache.aurora.scheduler.storage.durability.DurableStorage.doInTransaction(DurableStorage.java:198)
        at org.apache.aurora.scheduler.storage.durability.DurableStorage.write(DurableStorage.java:221)
        at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.write(CallOrderEnforcingStorage.java:132)
        at org.apache.aurora.scheduler.BatchWorker.processBatch(BatchWorker.java:207)
        at org.apache.aurora.scheduler.BatchWorker.run(BatchWorker.java:199)
        at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
        at com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
        at java.lang.Thread.run(Thread.java:748)
```

The steps are:
```
1. Schedule and start a cron (runs every minute, graceful shutdown period > 1 minute)
2. Perform 2 runs of the cron
3. Deschedule the cron
4. Reschedule the cron
5. Perform 3 runs of the cron
6. Scheduler will crash on the 3rd run due to an ID collision between the already running cron and a new cron trying to start
```
The reason for this bug is that some state is persisted between cron scheduling/descheduling via `killFollowups`. We use Quartz `JobDataMap` to hold a "work in progress" token, while the `killFollowups` set indicates "completion" in order to ensure there are no concurrent runs. Descheduling a cron will remove the "work in progress" token while ignoring the "completion" token in `killFollowups`. Later, a "work in progress" token may be added and a "completion" token may be seen mistakenly from a previous schedule, causing a concurrent run.

For the example above, the runs in step 2 will add the key to the set to show that all runs are finished and another run can start. The 3rd run in step 5 will mistakenly see that the 2nd run has started and finished since the "completion" token was preserved from the first set of runs in step 2. This will erroneously trigger a concurrent run causing a ID collision.

We should not preserve any state between cron scheduling/descheduling outside of the given Quartz `JobDataMap` abstraction. We can use the presence of a value here to achieve the same thing as `killFollowups`.


Diffs (updated)
-----

  src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 3604dd4c84043def3fe098ca0f87bf8bab99b3db 
  src/test/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJobTest.java 5083fcd407a069617634b82db1ad799b5d4d4551 


Diff: https://reviews.apache.org/r/65680/diff/3/

Changes: https://reviews.apache.org/r/65680/diff/2-3/


Testing
-------

Repro'd bug using steps above. Validated code changes solved issue.

Added small change in unit test to ensure state is now preserved in the Quartz job context.

`./gradlew test`


Thanks,

Jordan Ly


Re: Review Request 65680: Fix cron id collision bug by avoiding state in Quartz jobs

Posted by Aurora ReviewBot <wf...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65680/#review197687
-----------------------------------------------------------



Master (59f66ff) is red with this patch.
  ./build-support/jenkins/build.sh

:distTar
:distZip
:assemble
:compileTestJavaNote: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: /home/jenkins/jenkins-slave/workspace/AuroraBot/src/test/java/org/apache/aurora/scheduler/storage/durability/DurableStorageTest.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.

:processTestResources
:testClasses
:compileJmhJavaNote: /home/jenkins/jenkins-slave/workspace/AuroraBot/src/jmh/java/org/apache/aurora/benchmark/fakes/FakeSchedulerDriver.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

:processJmhResources NO-SOURCE
:jmhClasses
:checkstyleJmh
:checkstyleMain
:checkstyleTest
:licenseJmh UP-TO-DATE
:licenseMain UP-TO-DATE
:licenseTest UP-TO-DATE
:license UP-TO-DATE
:pmdJmh
:pmdMain
/home/jenkins/jenkins-slave/workspace/AuroraBot/src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java:149:	Avoid if (x != y) ..; else ..;
:pmdMain FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':pmdMain'.
> 1 PMD rule violations were found. See the report at: file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/pmd/main.html

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

* Get more help at https://help.gradle.org

BUILD FAILED in 5m 5s
38 actionable tasks: 29 executed, 9 up-to-date


I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Feb. 16, 2018, 7:45 p.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65680/
> -----------------------------------------------------------
> 
> (Updated Feb. 16, 2018, 7:45 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Renan DelValle, and Santhosh Kumar Shanmugham.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> There is a pretty rare situation that can occur that will cause the scheduler to crash.
> 
> Stacktrace:
> ```
> Feb 16, 2018 12:22:36 AM com.google.common.util.concurrent.ServiceManager$ServiceListener failed
> SEVERE: Service CronBatchWorker [FAILED] has failed in the RUNNING state.
> java.lang.IllegalArgumentException: Instance ID collision detected.
>         at org.apache.aurora.scheduler.state.StateManagerImpl.insertPendingTasks(StateManagerImpl.java:126)
>         at org.apache.aurora.scheduler.cron.quartz.AuroraCronJob.lambda$null$0(AuroraCronJob.java:213)
>         at org.apache.aurora.scheduler.BatchWorker.lambda$processBatch$3(BatchWorker.java:210)
>         at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:144)
>         at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:139)
>         at org.apache.aurora.scheduler.storage.durability.DurableStorage.lambda$doInTransaction$0(DurableStorage.java:199)
>         at org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:89)
>         at org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
>         at org.apache.aurora.scheduler.storage.durability.DurableStorage.doInTransaction(DurableStorage.java:198)
>         at org.apache.aurora.scheduler.storage.durability.DurableStorage.write(DurableStorage.java:221)
>         at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.write(CallOrderEnforcingStorage.java:132)
>         at org.apache.aurora.scheduler.BatchWorker.processBatch(BatchWorker.java:207)
>         at org.apache.aurora.scheduler.BatchWorker.run(BatchWorker.java:199)
>         at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
>         at com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
>         at java.lang.Thread.run(Thread.java:748)
> ```
> 
> The steps are:
> ```
> 1. Schedule and start a cron (runs every minute, graceful shutdown period > 1 minute)
> 2. Perform 2 runs of the cron
> 3. Deschedule the cron
> 4. Reschedule the cron
> 5. Perform 3 runs of the cron
> 6. Scheduler will crash on the 3rd run due to an ID collision between the already running cron and a new cron trying to start
> ```
> The reason for this bug is that some state is persisted between cron scheduling/descheduling via `killFollowups`. We use Quartz `JobDataMap` to hold a "work in progress" token, while the `killFollowups` set indicates "completion" in order to ensure there are no concurrent runs. Descheduling a cron will remove the "work in progress" token while ignoring the "completion" token in `killFollowups`. Later, a "work in progress" token may be added and a "completion" token may be seen mistakenly from a previous schedule, causing a concurrent run.
> 
> For the example above, the runs in step 2 will add the key to the set to show that all runs are finished and another run can start. The 3rd run in step 5 will mistakenly see that the 2nd run has started and finished since the "completion" token was preserved from the first set of runs in step 2. This will erroneously trigger a concurrent run causing a ID collision.
> 
> We should not preserve any state between cron scheduling/descheduling outside of the given Quartz `JobDataMap` abstraction. We can use the presence of a value here to achieve the same thing as `killFollowups`.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 3604dd4c84043def3fe098ca0f87bf8bab99b3db 
>   src/test/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJobTest.java 5083fcd407a069617634b82db1ad799b5d4d4551 
> 
> 
> Diff: https://reviews.apache.org/r/65680/diff/2/
> 
> 
> Testing
> -------
> 
> Repro'd bug using steps above. Validated code changes solved issue.
> 
> Added small change in unit test to ensure state is now preserved in the Quartz job context.
> 
> `./gradlew test`
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 65680: Fix cron id collision bug by avoiding state in Quartz jobs

Posted by Santhosh Kumar Shanmugham <sa...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65680/#review197700
-----------------------------------------------------------


Ship it!




Ship It!

- Santhosh Kumar Shanmugham


On Feb. 16, 2018, 10:45 a.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65680/
> -----------------------------------------------------------
> 
> (Updated Feb. 16, 2018, 10:45 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Renan DelValle, and Santhosh Kumar Shanmugham.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> There is a pretty rare situation that can occur that will cause the scheduler to crash.
> 
> Stacktrace:
> ```
> Feb 16, 2018 12:22:36 AM com.google.common.util.concurrent.ServiceManager$ServiceListener failed
> SEVERE: Service CronBatchWorker [FAILED] has failed in the RUNNING state.
> java.lang.IllegalArgumentException: Instance ID collision detected.
>         at org.apache.aurora.scheduler.state.StateManagerImpl.insertPendingTasks(StateManagerImpl.java:126)
>         at org.apache.aurora.scheduler.cron.quartz.AuroraCronJob.lambda$null$0(AuroraCronJob.java:213)
>         at org.apache.aurora.scheduler.BatchWorker.lambda$processBatch$3(BatchWorker.java:210)
>         at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:144)
>         at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:139)
>         at org.apache.aurora.scheduler.storage.durability.DurableStorage.lambda$doInTransaction$0(DurableStorage.java:199)
>         at org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:89)
>         at org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
>         at org.apache.aurora.scheduler.storage.durability.DurableStorage.doInTransaction(DurableStorage.java:198)
>         at org.apache.aurora.scheduler.storage.durability.DurableStorage.write(DurableStorage.java:221)
>         at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.write(CallOrderEnforcingStorage.java:132)
>         at org.apache.aurora.scheduler.BatchWorker.processBatch(BatchWorker.java:207)
>         at org.apache.aurora.scheduler.BatchWorker.run(BatchWorker.java:199)
>         at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
>         at com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
>         at java.lang.Thread.run(Thread.java:748)
> ```
> 
> The steps are:
> ```
> 1. Schedule and start a cron (runs every minute, graceful shutdown period > 1 minute)
> 2. Perform 2 runs of the cron
> 3. Deschedule the cron
> 4. Reschedule the cron
> 5. Perform 3 runs of the cron
> 6. Scheduler will crash on the 3rd run due to an ID collision between the already running cron and a new cron trying to start
> ```
> The reason for this bug is that some state is persisted between cron scheduling/descheduling via `killFollowups`. We use Quartz `JobDataMap` to hold a "work in progress" token, while the `killFollowups` set indicates "completion" in order to ensure there are no concurrent runs. Descheduling a cron will remove the "work in progress" token while ignoring the "completion" token in `killFollowups`. Later, a "work in progress" token may be added and a "completion" token may be seen mistakenly from a previous schedule, causing a concurrent run.
> 
> For the example above, the runs in step 2 will add the key to the set to show that all runs are finished and another run can start. The 3rd run in step 5 will mistakenly see that the 2nd run has started and finished since the "completion" token was preserved from the first set of runs in step 2. This will erroneously trigger a concurrent run causing a ID collision.
> 
> We should not preserve any state between cron scheduling/descheduling outside of the given Quartz `JobDataMap` abstraction. We can use the presence of a value here to achieve the same thing as `killFollowups`.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 3604dd4c84043def3fe098ca0f87bf8bab99b3db 
>   src/test/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJobTest.java 5083fcd407a069617634b82db1ad799b5d4d4551 
> 
> 
> Diff: https://reviews.apache.org/r/65680/diff/2/
> 
> 
> Testing
> -------
> 
> Repro'd bug using steps above. Validated code changes solved issue.
> 
> Added small change in unit test to ensure state is now preserved in the Quartz job context.
> 
> `./gradlew test`
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>


Re: Review Request 65680: Fix cron id collision bug by avoiding state in Quartz jobs

Posted by Jordan Ly <jo...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65680/
-----------------------------------------------------------

(Updated Feb. 16, 2018, 6:45 p.m.)


Review request for Aurora, David McLaughlin, Renan DelValle, and Santhosh Kumar Shanmugham.


Changes
-------

Remove unused imports


Repository: aurora


Description
-------

There is a pretty rare situation that can occur that will cause the scheduler to crash.

Stacktrace:
```
Feb 16, 2018 12:22:36 AM com.google.common.util.concurrent.ServiceManager$ServiceListener failed
SEVERE: Service CronBatchWorker [FAILED] has failed in the RUNNING state.
java.lang.IllegalArgumentException: Instance ID collision detected.
        at org.apache.aurora.scheduler.state.StateManagerImpl.insertPendingTasks(StateManagerImpl.java:126)
        at org.apache.aurora.scheduler.cron.quartz.AuroraCronJob.lambda$null$0(AuroraCronJob.java:213)
        at org.apache.aurora.scheduler.BatchWorker.lambda$processBatch$3(BatchWorker.java:210)
        at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:144)
        at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:139)
        at org.apache.aurora.scheduler.storage.durability.DurableStorage.lambda$doInTransaction$0(DurableStorage.java:199)
        at org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:89)
        at org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
        at org.apache.aurora.scheduler.storage.durability.DurableStorage.doInTransaction(DurableStorage.java:198)
        at org.apache.aurora.scheduler.storage.durability.DurableStorage.write(DurableStorage.java:221)
        at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.write(CallOrderEnforcingStorage.java:132)
        at org.apache.aurora.scheduler.BatchWorker.processBatch(BatchWorker.java:207)
        at org.apache.aurora.scheduler.BatchWorker.run(BatchWorker.java:199)
        at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
        at com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
        at java.lang.Thread.run(Thread.java:748)
```

The steps are:
```
1. Schedule and start a cron (runs every minute, graceful shutdown period > 1 minute)
2. Perform 2 runs of the cron
3. Deschedule the cron
4. Reschedule the cron
5. Perform 3 runs of the cron
6. Scheduler will crash on the 3rd run due to an ID collision between the already running cron and a new cron trying to start
```
The reason for this bug is that some state is persisted between cron scheduling/descheduling via `killFollowups`. We use Quartz `JobDataMap` to hold a "work in progress" token, while the `killFollowups` set indicates "completion" in order to ensure there are no concurrent runs. Descheduling a cron will remove the "work in progress" token while ignoring the "completion" token in `killFollowups`. Later, a "work in progress" token may be added and a "completion" token may be seen mistakenly from a previous schedule, causing a concurrent run.

For the example above, the runs in step 2 will add the key to the set to show that all runs are finished and another run can start. The 3rd run in step 5 will mistakenly see that the 2nd run has started and finished since the "completion" token was preserved from the first set of runs in step 2. This will erroneously trigger a concurrent run causing a ID collision.

We should not preserve any state between cron scheduling/descheduling outside of the given Quartz `JobDataMap` abstraction. We can use the presence of a value here to achieve the same thing as `killFollowups`.


Diffs (updated)
-----

  src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 3604dd4c84043def3fe098ca0f87bf8bab99b3db 
  src/test/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJobTest.java 5083fcd407a069617634b82db1ad799b5d4d4551 


Diff: https://reviews.apache.org/r/65680/diff/2/

Changes: https://reviews.apache.org/r/65680/diff/1-2/


Testing
-------

Repro'd bug using steps above. Validated code changes solved issue.

Added small change in unit test to ensure state is now preserved in the Quartz job context.

`./gradlew test`


Thanks,

Jordan Ly