You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@aurora.apache.org by "Aman Thakral (JIRA)" <ji...@apache.org> on 2014/05/14 20:55:18 UTC

[jira] [Issue Comment Deleted] (AURORA-420) scheduler crash due to corrupt replica data?

     [ https://issues.apache.org/jira/browse/AURORA-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aman Thakral updated AURORA-420:
--------------------------------

    Comment: was deleted

(was: I've been seeing similar behavior.  The aurora process seems to crash every 2 or 3 days at seemingly random times.  If I reboot the machine, the process works again correctly, but only for a short period of time.  My last pull was on May 12 (commit: 90423243977f141002319f9cd4bd59bcee33aefe).  I'll post my logs the next time I see this problem.)

> scheduler crash due to corrupt replica data?
> --------------------------------------------
>
>                 Key: AURORA-420
>                 URL: https://issues.apache.org/jira/browse/AURORA-420
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 0.6.0
>            Reporter: Bhuvan Arumugam
>
> We are using latest as of https://github.com/apache/incubator-aurora/commit/90423243977f141002319f9cd4bd59bcee33aefe. Technically it's 0.5.1-snapshot.
> The scheduler seem to crash due to corrupt data in replica. It had crashed twice in last 2 days. Here is the log snippet.
> Last time when we started scheduler after similar crash, all jobs were lost. We were running around 30 apps in different slaves during the crash. The apps are still running in slaves though. The slaves are shown as running master ui. The scheduler seem to have trouble reconnecting to the running tasks when it come back online. FWIW, we are not using checkpoint.
> Can you let me know?
>   1. how to prevent the crashes?
>   2. how to recover jobs from replica backup?
> {code}
> I0513 15:07:39.982774 25560 log.cpp:680] Attempting to append 125 bytes to the log
> I0513 15:07:39.982879 25545 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 29779
> I0513 15:07:39.983695 25543 replica.cpp:508] Replica received write request for position 29779
> I0513 15:07:39.986923 25543 leveldb.cpp:341] Persisting action (144 bytes) to leveldb took 3.177192ms
> I0513 15:07:39.986961 25543 replica.cpp:676] Persisted action at 29779
> I0513 15:07:39.987192 25543 replica.cpp:655] Replica received learned notice for position 29779
> I0513 15:07:39.989861 25543 leveldb.cpp:341] Persisting action (146 bytes) to leveldb took 2.637372ms
> I0513 15:07:39.989895 25543 replica.cpp:676] Persisted action at 29779
> I0513 15:07:39.989907 25543 replica.cpp:661] Replica learned APPEND action at position 29779
> I0513 22:07:46.621 THREAD5299 org.apache.aurora.scheduler.async.OfferQueue$OfferQueueImpl.addOffer: Returning offers for 20140512-151150-360689681-5050-7152-6 for compaction.
> I0513 22:08:39.641 THREAD5301 org.apache.aurora.scheduler.async.OfferQueue$OfferQueueImpl.addOffer: Returning offers for 20140512-151150-360689681-5050-7152-9 for compaction.
> I0513 22:10:20.474 THREAD29 org.apache.aurora.scheduler.SchedulerLifecycle$6$4.run: Triggering automatic failover.
> I0513 22:10:20.475 THREAD29 com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle state machine transition ACTIVE -> DEAD
> I0513 15:10:20.486500 25562 sched.cpp:731] Stopping framework '2014-03-26-13:02:35-360689681-5050-31080-0000'
> I0513 22:10:20.486 THREAD29 com.twitter.common.util.StateMachine$Builder$1.execute: storage state machine transition READY -> STOPPED
> W0513 22:10:20.486 THREAD24 com.twitter.common.zookeeper.ServerSetImpl$ServerSetWatcher.notifyServerSetChange: server set empty for path /aurora/scheduler
> I0513 22:10:20.486 THREAD31 com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle state machine transition DEAD -> DEAD
> I0513 22:10:20.486 THREAD29 com.twitter.common.application.Lifecycle.shutdown: Shutting down application
> I0513 22:10:20.487 THREAD31 org.apache.aurora.scheduler.SchedulerLifecycle$8.execute: Shutdown already invoked, ignoring extra call.
> W0513 22:10:20.486 THREAD24 org.apache.aurora.scheduler.http.LeaderRedirect$SchedulerMonitor.onChange: No schedulers in host set, will not redirect despite not being leader.
> I0513 22:10:20.487 THREAD29 com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute: Executing 8 shutdown commands.
> W0513 22:10:20.488 THREAD24 com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: All candidates have temporarily left the group: Group /aurora/scheduler
> E0513 22:10:20.488 THREAD24 org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onDefeated: Lost leadership, committing suicide.
> I0513 22:10:20.489 THREAD24 com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle state machine transition DEAD -> DEAD
> I0513 22:10:20.489 THREAD24 org.apache.aurora.scheduler.SchedulerLifecycle$8.execute: Shutdown already invoked, ignoring extra call.
> I0513 22:10:20.491 THREAD29 org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute: Shutdown initiated by: Thread: Lifecycle-0 (id 29)
> java.lang.Thread.getStackTrace(Thread.java:1588)
>   org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute(AppModule.java:151)
>   com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute(ShutdownRegistry.java:88)
>   com.twitter.common.application.Lifecycle.shutdown(Lifecycle.java:92)
>   org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:382)
>   org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:354)
>   com.twitter.common.base.Closures$4.execute(Closures.java:120)
>   com.twitter.common.base.Closures$3.execute(Closures.java:98)
>   com.twitter.common.util.StateMachine.transition(StateMachine.java:191)
>   org.apache.aurora.scheduler.SchedulerLifecycle$6$4.run(SchedulerLifecycle.java:287)
>   java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
>   java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
>   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   java.lang.Thread.run(Thread.java:744)
> I0513 22:10:20.491 THREAD29 com.twitter.common.stats.TimeSeriesRepositoryImpl$3.execute: Variable sampler shut down
> I0513 22:10:20.491 THREAD29 org.apache.aurora.scheduler.thrift.ThriftServerLauncher$1.execute: Stopping thrift server.
> I0513 22:10:20.491 THREAD29 org.apache.aurora.scheduler.thrift.ThriftServer.shutdown: Received shutdown request, stopping server.
> I0513 22:10:20.491 THREAD29 org.apache.aurora.scheduler.thrift.ThriftServer.setStatus: Moving from status ALIVE to STOPPING
> I0513 22:10:20.492 THREAD29 org.apache.aurora.scheduler.thrift.ThriftServer.setStatus: Moving from status STOPPING to STOPPED
> I0513 22:10:20.492 THREAD29 com.twitter.common.application.modules.HttpModule$HttpServerLauncher$1.execute: Shutting down embedded http server
> I0513 22:10:20.492 THREAD29 org.mortbay.log.Slf4jLog.info: Stopped SelectChannelConnector@0.0.0.0:8081
> I0513 22:10:20.594 THREAD29 com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle state machine transition DEAD -> DEAD
> I0513 22:10:20.594 THREAD29 org.apache.aurora.scheduler.SchedulerLifecycle$8.execute: Shutdown already invoked, ignoring extra call.
> I0513 22:10:20.595 THREAD1 com.twitter.common.application.AppLauncher.run: Application run() exited.
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)