You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@aurora.apache.org by GitBox <gi...@apache.org> on 2018/08/16 22:03:26 UTC

[GitHub] jordanly opened a new issue #32: Generic exceptions within storage.write statements are not caught potentially causing inconsistent state

jordanly opened a new issue #32: Generic exceptions within storage.write statements are not caught potentially causing inconsistent state
URL: https://github.com/apache/aurora/issues/32
 
 
   A finding from https://github.com/apache/aurora/issues/31.
   
   A user created an update to remove instances from a job. This throws a NullPointerException as mentioned in the issue above. The [LoggingInterceptor actually swallows the exception](https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/aop/LoggingInterceptor.java#L102-L107). This happens because we do the initial evaluation of the update within the user calling the RPC method ([follow along the start(...) method if you are not convinced](https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java#L185)).
   
   **Although the above start command throws a NullPointerException, the update is still added to the MemJobUpdateStore but not persisted to the log.** We still call [saveJobUpdate(...)](https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java#L221) within the ‘start(...)’ code which will add it to the memory stores. However, because a NullPointerException is thrown before the write lock is exited, these operations are [never persisted to the log](https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/durability/DurableStorage.java#L201-L210). The design of the storage system in the scheduler is transactional so everything is added to the log at the end of the write. Due to this, we are now in a state where the memory store does not match the log store.
   
   I think that we should catch all unhandled exceptions within the write lock and immediately kill the scheduler. This would avoid errors leaving a potentially inconsistent state and corrupting the log preventing easy rollback.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services