You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@aurora.apache.org by "Maxim Khutornenko (JIRA)" <ji...@apache.org> on 2016/02/01 18:27:39 UTC

[jira] [Commented] (AURORA-1603) Investigate RB:42922 reversal

    [ https://issues.apache.org/jira/browse/AURORA-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15126600#comment-15126600 ] 

Maxim Khutornenko commented on AURORA-1603:
-------------------------------------------

Sorry, don't have time to put together a script but here is a sequence of steps to reproduce:
* Checkout any commit _before_ 89fad5a8895482b6c3fa45356137aa250d766dfe and create a few job updates. The key here is to have a few updates with identical TaskConfigs stored as {{initialState}}. The easiest is probably to have a job with at least 2-3 instances in it targeted by at least 2 job updates with {{batch_size=1}}:
** Start job update and immediately abort it (may update the first instance but should leave others untouched)
** Start second job update.
** At this point there will be 2 job updates sharing the {{TaskConfig}} of instances with IDs >=1 stored in {{JobUpdateInstructions.initialState}}.
* Upgrade cluster to 89fad5a8895482b6c3fa45356137aa250d766dfe build
* *Important*: trigger snapshot creation via {{aurora_admin scheduler_snapshot devcluster}}
* Rollback to any earlier version and rebuild scheduler
* Upon restart, the scheduler will keep failing with the error above.

A bit more details on why this happens. Once a build is upgraded to 89fad5a8895482b6c3fa45356137aa250d766dfe, there are no more {{jobName}} and {{environment}} fields in the {{TaskConfig}}. This is fine as long as we don't rollback. If we do though, the earlier version thrift schema will populate {{TaskConfig}} objects read from snapshot with NULL {{jobName}} and {{environment}} fields. Now, when the TaskConfigs with NULL fields are passed into the mentioned above {{getRowConfig}} function to find a match against DB-stored configs the match will never be found. This is due to [resultMap|https://github.com/apache/aurora/commit/89fad5a8895482b6c3fa45356137aa250d766dfe#diff-e140306b7d9b86b2e4657e014d74fe28L133] still populating the removed fields for DB-read objects that snapshot-read objects now lack.

> Investigate RB:42922 reversal
> -----------------------------
>
>                 Key: AURORA-1603
>                 URL: https://issues.apache.org/jira/browse/AURORA-1603
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>            Reporter: Maxim Khutornenko
>            Assignee: Maxim Khutornenko
>            Priority: Critical
>
> We had to rollback scheduler due to the duplicate instances in the UI and when tried to restart on the older version (8d3fb2413306387bc533b1b800bbc97149f96b26) got the following error preventing scheduler from loading snapshot:
> {noformat}
> To index multiple values under a key, use Multimaps.index.
>         at com.google.common.collect.Maps.uniqueIndex(Maps.java:1215) ~[guava-19.0.jar:na]
>         at com.google.common.collect.Maps.uniqueIndex(Maps.java:1173) ~[guava-19.0.jar:na]
>         at org.apache.aurora.scheduler.storage.db.TaskConfigManager.getConfigRow(TaskConfigManager.java:46) ~[aurora-113.jar:na]
>         at org.apache.aurora.scheduler.storage.db.TaskConfigManager.insert(TaskConfigManager.java:57) ~[aurora-113.jar:na]
>         at org.apache.aurora.scheduler.storage.db.DbJobUpdateStore.saveJobUpdate(DbJobUpdateStore.java:125) ~[aurora-113.jar:na]
>         at org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) ~[commons-113.jar:na]
>         at org.apache.aurora.scheduler.storage.log.SnapshotStoreImpl$7.restoreFromSnapshot(SnapshotStoreImpl.java:208) ~[aurora-113.jar:na]
>         at org.apache.aurora.scheduler.storage.log.SnapshotStoreImpl.lambda$applySnapshot$238(SnapshotStoreImpl.java:278) ~[aurora-113.jar:na]
>         at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:137) ~[aurora-113.jar:na]
>         at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:132) ~[aurora-113.jar:na]
>         at org.apache.aurora.scheduler.storage.db.DbStorage.transactionedWrite(DbStorage.java:146) ~[aurora-113.jar:na]
>         at org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101) ~[mybatis-guice-3.7.jar:3.7]
>         at org.apache.aurora.scheduler.storage.db.DbStorage.lambda$write$203(DbStorage.java:160) ~[aurora-113.jar:na]
>         at org.apache.aurora.scheduler.async.GatingDelayExecutor.closeDuring(GatingDelayExecutor.java:62) ~[aurora-113.jar:na]
>         at org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:158) ~[aurora-113.jar:na]
>         at org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) ~[commons-113.jar:na]
>         at org.apache.aurora.scheduler.storage.log.SnapshotStoreImpl.applySnapshot(SnapshotStoreImpl.java:274) ~[aurora-113.jar:na]
>         at org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) ~[commons-113.jar:na]
>         at org.apache.aurora.scheduler.storage.log.SnapshotStoreImpl.applySnapshot(SnapshotStoreImpl.java:63) ~[aurora-113.jar:na]
>         at org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) ~[commons-113.jar:na]
> ...
> {noformat}
> We blamed that to fee5943a95c4f08e148dc5f1366486a8c23d5773 and reverted it in https://reviews.apache.org/r/42922/. I have been unable to reproduce it in unit tests yet. Need some further investigation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)