You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@aurora.apache.org by "Bill Farner (JIRA)" <ji...@apache.org> on 2014/01/17 17:54:19 UTC
[jira] [Commented] (AURORA-51) Scheduler stalls during startup if storage recovery fails

    [ https://issues.apache.org/jira/browse/AURORA-51?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874944#comment-13874944 ] 

Bill Farner commented on AURORA-51:
-----------------------------------

More detail on the regression, this is how the code looked before {{ad999b9}}:
{noformat}
      try {
        lead();
        control.advertise();
      } catch (Group.JoinException e) {
        LOG.log(Level.SEVERE, "Failed to advertise leader, shutting down.", e);
        lifecycle.shutdown();
      } catch (InterruptedException e) {
        LOG.log(Level.SEVERE, "Failed to update endpoint status, shutting down.", e);
        lifecycle.shutdown();
        Thread.currentThread().interrupt();
      } catch (RuntimeException e) {
        LOG.log(Level.SEVERE, "Unexpected exception attempting to lead, shutting down.", e);
        lifecycle.shutdown();
      }
{noformat}

> Scheduler stalls during startup if storage recovery fails
> ---------------------------------------------------------
>
>                 Key: AURORA-51
>                 URL: https://issues.apache.org/jira/browse/AURORA-51
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>            Reporter: Bill Farner
>            Assignee: Bill Farner
>            Priority: Critical
>
> If SchedulerLifecycle encounters a RuntimeException while initializing storage, it takes no action to abort.  The result is a leader in ZK that will never make progress and requires human intervention (killing the process).
> It would be prudent to consider a sweeping improvement in the course of fixing this, such as initiating a shutdown on any uncaught exception when transitioning in SchedulerLifecycle.
> {noformat}
> E0117 09:04:17.426 THREAD21 org.apache.zookeeper.ClientCnxn$EventThread.processEvent: Error while calling watcher
> org.apache.aurora.scheduler.storage.log.LogStorage$RecoveryFailedException: org.apache.aurora.scheduler.log.Log$Stream$StreamAccessException: Problem reading from log
>         at org.apache.aurora.scheduler.storage.log.LogStorage.recover(LogStorage.java:329)
>         at com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87)
>         at org.apache.aurora.scheduler.storage.log.LogStorage$2.execute(LogStorage.java:303)
>         at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:138)
>         at org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult$Quiet.apply(Storage.java:155)
>         at org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:146)
>         at com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87)
>         at org.apache.aurora.scheduler.storage.ForwardingStore.write(ForwardingStore.java:105)
>         at org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:475)
>         at org.apache.aurora.scheduler.storage.log.LogStorage.start(LogStorage.java:298)
>         at org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.start(CallOrderEnforcingStorage.java:94)
>         at org.apache.aurora.scheduler.SchedulerLifecycle$5.execute(SchedulerLifecycle.java:240)
>         at org.apache.aurora.scheduler.SchedulerLifecycle$5.execute(SchedulerLifecycle.java:237)
>         at com.twitter.common.base.Closures$4.execute(Closures.java:120)
>         at com.twitter.common.base.Closures$4.execute(Closures.java:120)
>         at com.twitter.common.base.Closures$3.execute(Closures.java:98)
>         at com.twitter.common.util.StateMachine.transition(StateMachine.java:191)
>         at org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onLeading(SchedulerLifecycle.java:446)
>         at com.twitter.common.zookeeper.SingletonService$1.onElected(SingletonService.java:168)
>         at com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange(CandidateImpl.java:155)
>         at com.twitter.common.zookeeper.Group$GroupMonitor.setMembers(Group.java:665)
>         at com.twitter.common.zookeeper.Group$GroupMonitor.watchGroup(Group.java:638)
>         at com.twitter.common.zookeeper.Group$GroupMonitor.access$900(Group.java:579)
>         at com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:600)
>         at com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:597)
>         at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:109)
>         at com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:107)
>         at com.twitter.common.util.BackoffHelper.doUntilResult(BackoffHelper.java:127)
>         at com.twitter.common.util.BackoffHelper.doUntilSuccess(BackoffHelper.java:107)
>         at com.twitter.common.zookeeper.Group$GroupMonitor.tryWatchGroup(Group.java:622)
>         at com.twitter.common.zookeeper.Group$GroupMonitor.access$1100(Group.java:579)
>         at com.twitter.common.zookeeper.Group$GroupMonitor$1.process(Group.java:591)
>         at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
>         at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
> Caused by: org.apache.aurora.scheduler.log.Log$Stream$StreamAccessException: Problem reading from log
>         at org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream$2.hasNext(MesosLog.java:255)
>         at org.apache.aurora.scheduler.storage.log.LogManager$StreamManager.readFromBeginning(LogManager.java:190)
>         at org.apache.aurora.scheduler.storage.log.LogStorage.recover(LogStorage.java:323)
>         ... 33 more
> Caused by: org.apache.mesos.Log$OperationFailedException: Bad read range (includes pending entries)
>         at org.apache.mesos.Log$Reader.read(Native Method)
>         at org.apache.aurora.scheduler.log.mesos.MesosLogStreamModule$4.read(MesosLogStreamModule.java:168)
>         at org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream$2.hasNext(MesosLog.java:233)
>         ... 35 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)