You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@aurora.apache.org by Mohit Jaggi <mo...@uber.com> on 2017/09/28 20:56:04 UTC

aurora crash in PendingTaskProcessor

Folks,

I saw the following crash in my scheduler. It appears to be due to
duplicates offers. Any insights appreciated!

Mohit.

*Code:*

https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/preemptor/PendingTaskProcessor.java#L145

*Logs:*

Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: Sep 28, 2017 6:09:00
PM com.google.common.util.concurrent.ServiceManager$ServiceListener failed


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: SEVERE: Service
PreemptorService [FAILED] has failed in the RUNNING state.


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]:
java.lang.IllegalArgumentException: Multiple entries with same key:
1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102=HostOffer{offer=id {


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: value:
"f7b84805-a0c5-4405-be77-f7f1b7110405-O56597202"


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: }


...

...


ep 28 18:09:00 machine1163 aurora-scheduler[14266]: ,
hostAttributes=IHostAttributes{host=compute606-dca1.prod.uber.internal,
attributes=[IAttribute{name=host, values=[compute606-dca1]},
IAttribute{name=rack, values=[as13]}, IAttribute{name=pod, values=[d]},
IAttribute{name=dedicated, values=[infra/cassandra]}], mode=NONE,
slaveId=1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102}}. To index multiple
values under a key, use Multimaps.index.


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
com.google.common.collect.Maps.uniqueIndex(Maps.java:1251)


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
com.google.common.collect.Maps.uniqueIndex(Maps.java:1208)


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.lambda$run$0(PendingTaskProcessor.java:146)


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
org.apache.aurora.scheduler.storage.db.DbStorage.read(DbStorage.java:147)


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
org.apache.aurora.scheduler.storage.log.LogStorage.read(LogStorage.java:562)


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.read(CallOrderEnforcingStorage.java:113)


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.run(PendingTaskProcessor.java:135)


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
org.apache.aurora.scheduler.preemptor.PreemptorModule$PreemptorService.runOneIteration(PreemptorModule.java:161)


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:188)


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
com.google.common.util.concurrent.Callables$4.run(Callables.java:122)


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
java.lang.Thread.run(Thread.java:748)


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: E0928 18:09:00.316
[PreemptorService RUNNING, GuavaUtils$LifecycleShutdownListener:55]
Service: PreemptorService [FAILED] failed unexpectedly. Triggering shutdown.


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316
[qtp1000734462-3068369, Slf4jRequestLog:60] 10.187.28.19 - -
[28/Sep/2017:18:09:00 +0000] "POST //10.188.43.6:8082/api HTTP/1.1" 200 95


Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316
[PreemptorService RUNNING, Lifecycle:84] Shutting down application



Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316
[PreemptorService RUNNING, ShutdownRegistry$ShutdownRegistryImpl:77]
Executing 4 shutdown commands.

Re: aurora crash in PendingTaskProcessor

Posted by Mohit Jaggi <mo...@uber.com>.
Thanks Bill. I will take you up on that offer. I filed AURORA-1952
<https://issues.apache.org/jira/browse/AURORA-1952>. Do I need some
privileges to assign it to myself?

On Fri, Sep 29, 2017 at 10:09 AM, Bill Farner <wf...@apache.org> wrote:

> concurrent map
>
>
> I'm looking at this chunk here
> <https://github.com/apache/aurora/blob/7a803730c95fc7d1f788292d83c3d2eeb81a936d/src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java#L193-L202>,
> where a concurrent map would not help.
>
>       Optional<HostOffer> sameSlave = hostOffers.get(offer.getOffer(
> ).getAgentId());
>       if (sameSlave.isPresent()) {
>         // If there are existing offers for the slave, decline all of them
> so the master can
>         // compact all of those offers into a single offer and send them
> back.
>         LOG.info("Returning offers for " + offer.getOffer().getAgentId().
> getValue()
>             + " for compaction.");
>         decline(offer.getOffer().getId());
>         removeAndDecline(sameSlave.get().getOffer().getId());
>       } else {
>         hostOffers.add(offer);
>
> This exhibits a classic check-then-act race on hostOffers, which could
> allow a second offer with the same agent ID.  An obvious fix here would be
> to move the "if exists, remove, else add" sequence in a synchronized method
> in hostOffers.
>
> Happy to help guide you on a patch!
>
>
>
> On Fri, Sep 29, 2017 at 9:57 AM, Mohit Jaggi <mo...@uber.com> wrote:
>
>> Will do. If the fix involves making the map of offers by agent id a
>> concurrent map...I can contribute that.
>>
>> On Fri, Sep 29, 2017 at 9:09 AM, Bill Farner <wf...@apache.org> wrote:
>>
>>> This is due to multiple offers for the same agent, rather than duplicate
>>> offers.  I don't see a specific bug in the suspect code
>>> (OfferManager.java), but it does stand out as subject to races.
>>> Specifically, there is a lack of synchronization when checking for an offer
>>> exists for a given agent ID and subsequently removing that offer.
>>>
>>> Can you file a bug?
>>>
>>> On Thu, Sep 28, 2017 at 1:56 PM, Mohit Jaggi <mo...@uber.com>
>>> wrote:
>>>
>>>> Folks,
>>>>
>>>> I saw the following crash in my scheduler. It appears to be due to
>>>> duplicates offers. Any insights appreciated!
>>>>
>>>> Mohit.
>>>>
>>>> *Code:*
>>>>
>>>> https://github.com/apache/aurora/blob/master/src/main/java/o
>>>> rg/apache/aurora/scheduler/preemptor/PendingTaskProcessor.java#L145
>>>>
>>>> *Logs:*
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: Sep 28, 2017
>>>> 6:09:00 PM com.google.common.util.concurrent.ServiceManager$ServiceListener
>>>> failed
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: SEVERE: Service
>>>> PreemptorService [FAILED] has failed in the RUNNING state.
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]:
>>>> java.lang.IllegalArgumentException: Multiple entries with same key:
>>>> 1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102=HostOffer{offer=id {
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: value:
>>>> "f7b84805-a0c5-4405-be77-f7f1b7110405-O56597202"
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: }
>>>>
>>>>
>>>> ...
>>>>
>>>> ...
>>>>
>>>>
>>>> ep 28 18:09:00 machine1163 aurora-scheduler[14266]: ,
>>>> hostAttributes=IHostAttributes{host=compute606-dca1.prod.uber.internal,
>>>> attributes=[IAttribute{name=host, values=[compute606-dca1]},
>>>> IAttribute{name=rack, values=[as13]}, IAttribute{name=pod, values=[d]},
>>>> IAttribute{name=dedicated, values=[infra/cassandra]}], mode=NONE,
>>>> slaveId=1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102}}. To index multiple
>>>> values under a key, use Multimaps.index.
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>>> com.google.common.collect.Maps.uniqueIndex(Maps.java:1251)
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>>> com.google.common.collect.Maps.uniqueIndex(Maps.java:1208)
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>>> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.l
>>>> ambda$run$0(PendingTaskProcessor.java:146)
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>>> org.apache.aurora.scheduler.storage.db.DbStorage.read(DbStor
>>>> age.java:147)
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>>> org.mybatis.guice.transactional.TransactionalMethodIntercept
>>>> or.invoke(TransactionalMethodInterceptor.java:101)
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>>> org.apache.aurora.common.inject.TimedInterceptor.invoke(Time
>>>> dInterceptor.java:83)
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>>> org.apache.aurora.scheduler.storage.log.LogStorage.read(LogS
>>>> torage.java:562)
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>>> org.apache.aurora.scheduler.storage.CallOrderEnforcingStorag
>>>> e.read(CallOrderEnforcingStorage.java:113)
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>>> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.r
>>>> un(PendingTaskProcessor.java:135)
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>>> org.apache.aurora.common.inject.TimedInterceptor.invoke(Time
>>>> dInterceptor.java:83)
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>>> org.apache.aurora.scheduler.preemptor.PreemptorModule$Preemp
>>>> torService.runOneIteration(PreemptorModule.java:161)
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>>> com.google.common.util.concurrent.AbstractScheduledService$S
>>>> erviceDelegate$Task.run(AbstractScheduledService.java:188)
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>>> com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>>> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>>>> tureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>>>> tureTask.run(ScheduledThreadPoolExecutor.java:294)
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>>> Executor.java:1142)
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>>> lExecutor.java:617)
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>>> java.lang.Thread.run(Thread.java:748)
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: E0928 18:09:00.316
>>>> [PreemptorService RUNNING, GuavaUtils$LifecycleShutdownListener:55]
>>>> Service: PreemptorService [FAILED] failed unexpectedly. Triggering shutdown.
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316
>>>> [qtp1000734462-3068369, Slf4jRequestLog:60] 10.187.28.19 - -
>>>> [28/Sep/2017:18:09:00 +0000] "POST //10.188.43.6:8082/api HTTP/1.1"
>>>> 200 95
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316
>>>> [PreemptorService RUNNING, Lifecycle:84] Shutting down application
>>>>
>>>>
>>>>
>>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316
>>>> [PreemptorService RUNNING, ShutdownRegistry$ShutdownRegistryImpl:77]
>>>> Executing 4 shutdown commands.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: aurora crash in PendingTaskProcessor

Posted by Bill Farner <wf...@apache.org>.
>
> concurrent map


I'm looking at this chunk here
<https://github.com/apache/aurora/blob/7a803730c95fc7d1f788292d83c3d2eeb81a936d/src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java#L193-L202>,
where a concurrent map would not help.

      Optional<HostOffer> sameSlave =
hostOffers.get(offer.getOffer().getAgentId());
      if (sameSlave.isPresent()) {
        // If there are existing offers for the slave, decline all of them
so the master can
        // compact all of those offers into a single offer and send them
back.
        LOG.info("Returning offers for " +
offer.getOffer().getAgentId().getValue()
            + " for compaction.");
        decline(offer.getOffer().getId());
        removeAndDecline(sameSlave.get().getOffer().getId());
      } else {
        hostOffers.add(offer);

This exhibits a classic check-then-act race on hostOffers, which could
allow a second offer with the same agent ID.  An obvious fix here would be
to move the "if exists, remove, else add" sequence in a synchronized method
in hostOffers.

Happy to help guide you on a patch!



On Fri, Sep 29, 2017 at 9:57 AM, Mohit Jaggi <mo...@uber.com> wrote:

> Will do. If the fix involves making the map of offers by agent id a
> concurrent map...I can contribute that.
>
> On Fri, Sep 29, 2017 at 9:09 AM, Bill Farner <wf...@apache.org> wrote:
>
>> This is due to multiple offers for the same agent, rather than duplicate
>> offers.  I don't see a specific bug in the suspect code
>> (OfferManager.java), but it does stand out as subject to races.
>> Specifically, there is a lack of synchronization when checking for an offer
>> exists for a given agent ID and subsequently removing that offer.
>>
>> Can you file a bug?
>>
>> On Thu, Sep 28, 2017 at 1:56 PM, Mohit Jaggi <mo...@uber.com>
>> wrote:
>>
>>> Folks,
>>>
>>> I saw the following crash in my scheduler. It appears to be due to
>>> duplicates offers. Any insights appreciated!
>>>
>>> Mohit.
>>>
>>> *Code:*
>>>
>>> https://github.com/apache/aurora/blob/master/src/main/java/o
>>> rg/apache/aurora/scheduler/preemptor/PendingTaskProcessor.java#L145
>>>
>>> *Logs:*
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: Sep 28, 2017
>>> 6:09:00 PM com.google.common.util.concurrent.ServiceManager$ServiceListener
>>> failed
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: SEVERE: Service
>>> PreemptorService [FAILED] has failed in the RUNNING state.
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]:
>>> java.lang.IllegalArgumentException: Multiple entries with same key:
>>> 1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102=HostOffer{offer=id {
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: value:
>>> "f7b84805-a0c5-4405-be77-f7f1b7110405-O56597202"
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: }
>>>
>>>
>>> ...
>>>
>>> ...
>>>
>>>
>>> ep 28 18:09:00 machine1163 aurora-scheduler[14266]: ,
>>> hostAttributes=IHostAttributes{host=compute606-dca1.prod.uber.internal,
>>> attributes=[IAttribute{name=host, values=[compute606-dca1]},
>>> IAttribute{name=rack, values=[as13]}, IAttribute{name=pod, values=[d]},
>>> IAttribute{name=dedicated, values=[infra/cassandra]}], mode=NONE,
>>> slaveId=1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102}}. To index multiple
>>> values under a key, use Multimaps.index.
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> com.google.common.collect.Maps.uniqueIndex(Maps.java:1251)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> com.google.common.collect.Maps.uniqueIndex(Maps.java:1208)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.l
>>> ambda$run$0(PendingTaskProcessor.java:146)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> org.apache.aurora.scheduler.storage.db.DbStorage.read(DbStor
>>> age.java:147)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> org.mybatis.guice.transactional.TransactionalMethodIntercept
>>> or.invoke(TransactionalMethodInterceptor.java:101)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> org.apache.aurora.common.inject.TimedInterceptor.invoke(Time
>>> dInterceptor.java:83)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> org.apache.aurora.scheduler.storage.log.LogStorage.read(LogS
>>> torage.java:562)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> org.apache.aurora.scheduler.storage.CallOrderEnforcingStorag
>>> e.read(CallOrderEnforcingStorage.java:113)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.r
>>> un(PendingTaskProcessor.java:135)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> org.apache.aurora.common.inject.TimedInterceptor.invoke(Time
>>> dInterceptor.java:83)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> org.apache.aurora.scheduler.preemptor.PreemptorModule$Preemp
>>> torService.runOneIteration(PreemptorModule.java:161)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> com.google.common.util.concurrent.AbstractScheduledService$S
>>> erviceDelegate$Task.run(AbstractScheduledService.java:188)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>>> tureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>>> tureTask.run(ScheduledThreadPoolExecutor.java:294)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>> Executor.java:1142)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>> lExecutor.java:617)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> java.lang.Thread.run(Thread.java:748)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: E0928 18:09:00.316
>>> [PreemptorService RUNNING, GuavaUtils$LifecycleShutdownListener:55]
>>> Service: PreemptorService [FAILED] failed unexpectedly. Triggering shutdown.
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316
>>> [qtp1000734462-3068369, Slf4jRequestLog:60] 10.187.28.19 - -
>>> [28/Sep/2017:18:09:00 +0000] "POST //10.188.43.6:8082/api HTTP/1.1" 200
>>> 95
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316
>>> [PreemptorService RUNNING, Lifecycle:84] Shutting down application
>>>
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316
>>> [PreemptorService RUNNING, ShutdownRegistry$ShutdownRegistryImpl:77]
>>> Executing 4 shutdown commands.
>>>
>>>
>>>
>>>
>>>
>>
>

Re: aurora crash in PendingTaskProcessor

Posted by Mohit Jaggi <mo...@uber.com>.
Will do. If the fix involves making the map of offers by agent id a
concurrent map...I can contribute that.

On Fri, Sep 29, 2017 at 9:09 AM, Bill Farner <wf...@apache.org> wrote:

> This is due to multiple offers for the same agent, rather than duplicate
> offers.  I don't see a specific bug in the suspect code
> (OfferManager.java), but it does stand out as subject to races.
> Specifically, there is a lack of synchronization when checking for an offer
> exists for a given agent ID and subsequently removing that offer.
>
> Can you file a bug?
>
> On Thu, Sep 28, 2017 at 1:56 PM, Mohit Jaggi <mo...@uber.com> wrote:
>
>> Folks,
>>
>> I saw the following crash in my scheduler. It appears to be due to
>> duplicates offers. Any insights appreciated!
>>
>> Mohit.
>>
>> *Code:*
>>
>> https://github.com/apache/aurora/blob/master/src/main/java/
>> org/apache/aurora/scheduler/preemptor/PendingTaskProcessor.java#L145
>>
>> *Logs:*
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: Sep 28, 2017 6:09:00
>> PM com.google.common.util.concurrent.ServiceManager$ServiceListener
>> failed
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: SEVERE: Service
>> PreemptorService [FAILED] has failed in the RUNNING state.
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]:
>> java.lang.IllegalArgumentException: Multiple entries with same key:
>> 1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102=HostOffer{offer=id {
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: value:
>> "f7b84805-a0c5-4405-be77-f7f1b7110405-O56597202"
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: }
>>
>>
>> ...
>>
>> ...
>>
>>
>> ep 28 18:09:00 machine1163 aurora-scheduler[14266]: ,
>> hostAttributes=IHostAttributes{host=compute606-dca1.prod.uber.internal,
>> attributes=[IAttribute{name=host, values=[compute606-dca1]},
>> IAttribute{name=rack, values=[as13]}, IAttribute{name=pod, values=[d]},
>> IAttribute{name=dedicated, values=[infra/cassandra]}], mode=NONE,
>> slaveId=1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102}}. To index multiple
>> values under a key, use Multimaps.index.
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>> com.google.common.collect.Maps.uniqueIndex(Maps.java:1251)
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>> com.google.common.collect.Maps.uniqueIndex(Maps.java:1208)
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.lambda$run$0(
>> PendingTaskProcessor.java:146)
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>> org.apache.aurora.scheduler.storage.db.DbStorage.read(DbStorage.java:147)
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>> org.mybatis.guice.transactional.TransactionalMethodIntercept
>> or.invoke(TransactionalMethodInterceptor.java:101)
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>> org.apache.aurora.common.inject.TimedInterceptor.invoke(
>> TimedInterceptor.java:83)
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>> org.apache.aurora.scheduler.storage.log.LogStorage.read(LogS
>> torage.java:562)
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>> org.apache.aurora.scheduler.storage.CallOrderEnforcingStorag
>> e.read(CallOrderEnforcingStorage.java:113)
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.
>> run(PendingTaskProcessor.java:135)
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>> org.apache.aurora.common.inject.TimedInterceptor.invoke(
>> TimedInterceptor.java:83)
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>> org.apache.aurora.scheduler.preemptor.PreemptorModule$Preemp
>> torService.runOneIteration(PreemptorModule.java:161)
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>> com.google.common.util.concurrent.AbstractScheduledService$S
>> erviceDelegate$Task.run(AbstractScheduledService.java:188)
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>> com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>> tureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>> tureTask.run(ScheduledThreadPoolExecutor.java:294)
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>> java.lang.Thread.run(Thread.java:748)
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: E0928 18:09:00.316
>> [PreemptorService RUNNING, GuavaUtils$LifecycleShutdownListener:55]
>> Service: PreemptorService [FAILED] failed unexpectedly. Triggering shutdown.
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316
>> [qtp1000734462-3068369, Slf4jRequestLog:60] 10.187.28.19 - -
>> [28/Sep/2017:18:09:00 +0000] "POST //10.188.43.6:8082/api HTTP/1.1" 200
>> 95
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316
>> [PreemptorService RUNNING, Lifecycle:84] Shutting down application
>>
>>
>>
>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316
>> [PreemptorService RUNNING, ShutdownRegistry$ShutdownRegistryImpl:77]
>> Executing 4 shutdown commands.
>>
>>
>>
>>
>>
>

Re: aurora crash in PendingTaskProcessor

Posted by Bill Farner <wf...@apache.org>.
This is due to multiple offers for the same agent, rather than duplicate
offers.  I don't see a specific bug in the suspect code
(OfferManager.java), but it does stand out as subject to races.
Specifically, there is a lack of synchronization when checking for an offer
exists for a given agent ID and subsequently removing that offer.

Can you file a bug?

On Thu, Sep 28, 2017 at 1:56 PM, Mohit Jaggi <mo...@uber.com> wrote:

> Folks,
>
> I saw the following crash in my scheduler. It appears to be due to
> duplicates offers. Any insights appreciated!
>
> Mohit.
>
> *Code:*
>
> https://github.com/apache/aurora/blob/master/src/main/
> java/org/apache/aurora/scheduler/preemptor/PendingTaskProcessor.java#L145
>
> *Logs:*
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: Sep 28, 2017 6:09:00
> PM com.google.common.util.concurrent.ServiceManager$ServiceListener failed
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: SEVERE: Service
> PreemptorService [FAILED] has failed in the RUNNING state.
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: java.lang.IllegalArgumentException:
> Multiple entries with same key: 1ed038e0-a3ef-4476-adfd-
> 70c86241c5f7-S102=HostOffer{offer=id {
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: value:
> "f7b84805-a0c5-4405-be77-f7f1b7110405-O56597202"
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: }
>
>
> ...
>
> ...
>
>
> ep 28 18:09:00 machine1163 aurora-scheduler[14266]: , hostAttributes=
> IHostAttributes{host=compute606-dca1.prod.uber.internal,
> attributes=[IAttribute{name=host, values=[compute606-dca1]},
> IAttribute{name=rack, values=[as13]}, IAttribute{name=pod, values=[d]},
> IAttribute{name=dedicated, values=[infra/cassandra]}], mode=NONE,
> slaveId=1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102}}. To index multiple
> values under a key, use Multimaps.index.
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> com.google.common.collect.Maps.uniqueIndex(Maps.java:1251)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> com.google.common.collect.Maps.uniqueIndex(Maps.java:1208)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.lambda$
> run$0(PendingTaskProcessor.java:146)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> org.apache.aurora.scheduler.storage.db.DbStorage.read(DbStorage.java:147)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at org.mybatis.guice.
> transactional.TransactionalMethodInterceptor.invoke(
> TransactionalMethodInterceptor.java:101)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> org.apache.aurora.common.inject.TimedInterceptor.
> invoke(TimedInterceptor.java:83)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> org.apache.aurora.scheduler.storage.log.LogStorage.read(
> LogStorage.java:562)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.read(
> CallOrderEnforcingStorage.java:113)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.run(
> PendingTaskProcessor.java:135)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> org.apache.aurora.common.inject.TimedInterceptor.
> invoke(TimedInterceptor.java:83)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> org.apache.aurora.scheduler.preemptor.PreemptorModule$PreemptorService.
> runOneIteration(PreemptorModule.java:161)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> com.google.common.util.concurrent.AbstractScheduledService$
> ServiceDelegate$Task.run(AbstractScheduledService.java:188)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(
> ScheduledThreadPoolExecutor.java:294)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> java.lang.Thread.run(Thread.java:748)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: E0928 18:09:00.316
> [PreemptorService RUNNING, GuavaUtils$LifecycleShutdownListener:55]
> Service: PreemptorService [FAILED] failed unexpectedly. Triggering shutdown.
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316
> [qtp1000734462-3068369, Slf4jRequestLog:60] 10.187.28.19 - -
> [28/Sep/2017:18:09:00 +0000] "POST //10.188.43.6:8082/api HTTP/1.1" 200 95
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316
> [PreemptorService RUNNING, Lifecycle:84] Shutting down application
>
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316
> [PreemptorService RUNNING, ShutdownRegistry$ShutdownRegistryImpl:77]
> Executing 4 shutdown commands.
>
>
>
>
>