You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Guangya Liu (JIRA)" <ji...@apache.org> on 2016/03/02 08:00:23 UTC

[jira] [Commented] (MESOS-4831) Master sometimes sends two inverse offers after the agent goes into maintenance.

    [ https://issues.apache.org/jira/browse/MESOS-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175161#comment-15175161 ] 

Guangya Liu commented on MESOS-4831:
------------------------------------

There is a bug when setting host maintain with http endpoint: https://github.com/apache/mesos/blob/master/src/master/http.cpp#L1987-L2021

The logic is as this:
1) Get all host list from maintain window and put it to {{updated}} hashmap.
2) If the machine in was in {{updated}}  was also in {{master->machines}},  call master {{updateUnavailability}} to trigger {{recoverResources}}, {{updateUnavailability}} etc in {{allocator}}
3) Otherwise, clear the unavailabity time window for the machine.
4) Update each new machines in {{updated}} to call master {{updateUnavailability}}

But the logic in step 4) is getting all machines from the schedule windows but not the machines that is new to the cluster, this caused master get two {{updateUnavailability}} calls for a machine in the {{updated}} hashmap.


> Master sometimes sends two inverse offers after the agent goes into maintenance.
> --------------------------------------------------------------------------------
>
>                 Key: MESOS-4831
>                 URL: https://issues.apache.org/jira/browse/MESOS-4831
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.27.0
>            Reporter: Anand Mazumdar
>            Assignee: Guangya Liu
>              Labels: maintenance, mesosphere
>
> Showed up on ASF CI for {{MasterMaintenanceTest.PendingUnavailabilityTest}}
> https://builds.apache.org/job/Mesos/1748/COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/consoleFull
> {code}
> I0229 11:08:57.027559   668 hierarchical.cpp:1437] No resources available to allocate!
> I0229 11:08:57.027745   668 hierarchical.cpp:1150] Performed allocation for slave fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-S0 in 272747ns
> I0229 11:08:57.027757   675 master.cpp:5369] Sending 1 offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000 (default)
> I0229 11:08:57.028586   675 master.cpp:5459] Sending 1 inverse offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000 (default)
> I0229 11:08:57.029039   675 master.cpp:5459] Sending 1 inverse offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000 (default)
> {code}
> The ideal expected workflow for this test is something like:
> - The framework receives offers from master.
> - The framework updates its maintenance schedule.
> - The current offer is rescinded.
> - A new offer is received from the master with unavailability set.
> - After the agent goes for maintenance, an inverse offer is sent.
> For some reason, in the logs we see that the master is sending 2 inverse offers. The test seems to pass as we just check for the initial inverse offer being present. This can also be reproduced by a modified version of the original test.
> {code}
> // Test ensures that an offer will have an `unavailability` set if the
> // slave is scheduled to go down for maintenance.
> TEST_F(MasterMaintenanceTest, PendingUnavailabilityTest)
> {
>   Try<PID<Master>> master = StartMaster();
>   ASSERT_SOME(master);
>   MockExecutor exec(DEFAULT_EXECUTOR_ID);
>   Try<PID<Slave>> slave = StartSlave(&exec);
>   ASSERT_SOME(slave);
>   auto scheduler = std::make_shared<MockV1HTTPScheduler>();
>   EXPECT_CALL(*scheduler, heartbeat(_))
>     .WillRepeatedly(Return()); // Ignore heartbeats.
>   Future<Nothing> connected;
>   EXPECT_CALL(*scheduler, connected(_))
>     .WillOnce(FutureSatisfy(&connected))
>     .WillRepeatedly(Return()); // Ignore future invocations.
>   scheduler::TestV1Mesos mesos(master.get(), ContentType::PROTOBUF, scheduler);
>   AWAIT_READY(connected);
>   Future<Event::Subscribed> subscribed;
>   EXPECT_CALL(*scheduler, subscribed(_, _))
>     .WillOnce(FutureArg<1>(&subscribed));
>   Future<Event::Offers> normalOffers;
>   Future<Event::Offers> unavailabilityOffers;
>   Future<Event::Offers> inverseOffers;
>   EXPECT_CALL(*scheduler, offers(_, _))
>     .WillOnce(FutureArg<1>(&normalOffers))
>     .WillOnce(FutureArg<1>(&unavailabilityOffers))
>     .WillOnce(FutureArg<1>(&inverseOffers));
>   // The original offers should be rescinded when the unavailability is changed.
>   Future<Nothing> offerRescinded;
>   EXPECT_CALL(*scheduler, rescind(_, _))
>     .WillOnce(FutureSatisfy(&offerRescinded));
>   {
>     Call call;
>     call.set_type(Call::SUBSCRIBE);
>     Call::Subscribe* subscribe = call.mutable_subscribe();
>     subscribe->mutable_framework_info()->CopyFrom(DEFAULT_V1_FRAMEWORK_INFO);
>     mesos.send(call);
>   }
>   AWAIT_READY(subscribed);
>   v1::FrameworkID frameworkId(subscribed->framework_id());
>   AWAIT_READY(normalOffers);
>   EXPECT_NE(0, normalOffers->offers().size());
>   // Regular offers shouldn't have unavailability.
>   foreach (const v1::Offer& offer, normalOffers->offers()) {
>     EXPECT_FALSE(offer.has_unavailability());
>   }
>   // Schedule this slave for maintenance.
>   MachineID machine;
>   machine.set_hostname(maintenanceHostname);
>   machine.set_ip(stringify(slave.get().address.ip));
>   const Time start = Clock::now() + Seconds(60);
>   const Duration duration = Seconds(120);
>   const Unavailability unavailability = createUnavailability(start, duration);
>   // Post a valid schedule with one machine.
>   maintenance::Schedule schedule = createSchedule(
>       {createWindow({machine}, unavailability)});
>   // We have a few seconds between the first set of offers and the
>   // next allocation of offers. This should be enough time to perform
>   // a maintenance schedule update. This update will also trigger the
>   // rescinding of offers from the scheduled slave.
>   Future<Response> response = process::http::post(
>       master.get(),
>       "maintenance/schedule",
>       headers,
>       stringify(JSON::protobuf(schedule)));
>   AWAIT_EXPECT_RESPONSE_STATUS_EQ(OK().status, response);
>   // The original offers should be rescinded when the unavailability
>   // is changed.
>   AWAIT_READY(offerRescinded);
>   AWAIT_READY(unavailabilityOffers);
>   EXPECT_NE(0, unavailabilityOffers->offers().size());
>   // Make sure the new offers have the unavailability set.
>   foreach (const v1::Offer& offer, unavailabilityOffers->offers()) {
>     EXPECT_TRUE(offer.has_unavailability());
>     EXPECT_EQ(
>         unavailability.start().nanoseconds(),
>         offer.unavailability().start().nanoseconds());
>     EXPECT_EQ(
>         unavailability.duration().nanoseconds(),
>         offer.unavailability().duration().nanoseconds());
>   }
>   // We also expect an inverse offer for the slave to go under
>   // maintenance.
>   AWAIT_READY(inverseOffers);
>   EXPECT_NE(0, inverseOffers->inverse_offers().size());
>   EXPECT_CALL(exec, shutdown(_))
>     .Times(AtMost(1));
>   EXPECT_CALL(*scheduler, disconnected(_))
>     .Times(AtMost(1));
>   Shutdown(); // Must shutdown before 'containerizer' gets deallocated.
> }
> {code}
> Also, unrelated, we need to clean up this test to not expect multiple offers i.e. remove {{numberOfOffers}} constant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)