You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Benjamin Bannier <be...@mesosphere.io> on 2018/05/02 14:35:34 UTC

Review Request 66908: Correctly reconciled dropped operation after agent failover.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66908/
-----------------------------------------------------------

Review request for mesos, Gaston Kleiman and Greg Mann.


Bugs: MESOS-8870
    https://issues.apache.org/jira/browse/MESOS-8870


Repository: mesos


Description
-------

When the master receives an `UpdateSlaveMessage` after agent failover
it previously did not correctly detect dropped operations (operations
known to the master, but unknown to the agent) and did not trigger
reconciliation for such operations.

This patch fixes the handler in the master so that such dropped
operations are reconciled.


Diffs
-----

  src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 
  src/tests/master_slave_reconciliation_tests.cpp 6bb4263323bcfd191c8e3c1ccba10a240e9ddd83 


Diff: https://reviews.apache.org/r/66908/diff/1/


Testing
-------

`make check`

The test in this patch fails without the corresponding master change, but succeeds with it applied.


Thanks,

Benjamin Bannier


Re: Review Request 66908: Correctly reconciled dropped operation after agent failover.

Posted by Gaston Kleiman <ga...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66908/#review202323
-----------------------------------------------------------


Fix it, then Ship it!





src/tests/master_slave_reconciliation_tests.cpp
Lines 371-372 (patched)
<https://reviews.apache.org/r/66908/#comment284084>

    Nit: many of the tests we already have do this instead:
    
    ```
      FrameworkInfo frameworkInfo = DEFAULT_FRAMEWORK_INFO;
      frameworkInfo.set_roles(0, DEFAULT_TEST_ROLE);
    ```


- Gaston Kleiman


On May 2, 2018, 7:35 a.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66908/
> -----------------------------------------------------------
> 
> (Updated May 2, 2018, 7:35 a.m.)
> 
> 
> Review request for mesos, Gaston Kleiman and Greg Mann.
> 
> 
> Bugs: MESOS-8870
>     https://issues.apache.org/jira/browse/MESOS-8870
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When the master receives an `UpdateSlaveMessage` after agent failover
> it previously did not correctly detect dropped operations (operations
> known to the master, but unknown to the agent) and did not trigger
> reconciliation for such operations.
> 
> This patch fixes the handler in the master so that such dropped
> operations are reconciled.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 
>   src/tests/master_slave_reconciliation_tests.cpp 6bb4263323bcfd191c8e3c1ccba10a240e9ddd83 
> 
> 
> Diff: https://reviews.apache.org/r/66908/diff/1/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> The test in this patch fails without the corresponding master change, but succeeds with it applied.
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>


Re: Review Request 66908: Correctly reconciled dropped operation after agent failover.

Posted by Benjamin Bannier <be...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66908/
-----------------------------------------------------------

(Updated May 4, 2018, 12:37 p.m.)


Review request for mesos, Gaston Kleiman and Greg Mann.


Changes
-------

Addressed comments.


Bugs: MESOS-8870
    https://issues.apache.org/jira/browse/MESOS-8870


Repository: mesos


Description
-------

When the master receives an `UpdateSlaveMessage` after agent failover
it previously did not correctly detect dropped operations (operations
known to the master, but unknown to the agent) and did not trigger
reconciliation for such operations.

This patch fixes the handler in the master so that such dropped
operations are reconciled.


Diffs (updated)
-----

  src/master/master.cpp 7a2f69c1fe2508e16c2685cd3490d5b1f59d6ac4 
  src/tests/master_slave_reconciliation_tests.cpp 6bb4263323bcfd191c8e3c1ccba10a240e9ddd83 


Diff: https://reviews.apache.org/r/66908/diff/2/

Changes: https://reviews.apache.org/r/66908/diff/1-2/


Testing
-------

`make check`

The test in this patch fails without the corresponding master change, but succeeds with it applied.


Thanks,

Benjamin Bannier


Re: Review Request 66908: Correctly reconciled dropped operation after agent failover.

Posted by Mesos Reviewbot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66908/#review202297
-----------------------------------------------------------



Patch looks great!

Reviews applied: [66908]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose --disable-libtool-wrappers' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh

- Mesos Reviewbot


On May 2, 2018, 2:35 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66908/
> -----------------------------------------------------------
> 
> (Updated May 2, 2018, 2:35 p.m.)
> 
> 
> Review request for mesos, Gaston Kleiman and Greg Mann.
> 
> 
> Bugs: MESOS-8870
>     https://issues.apache.org/jira/browse/MESOS-8870
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When the master receives an `UpdateSlaveMessage` after agent failover
> it previously did not correctly detect dropped operations (operations
> known to the master, but unknown to the agent) and did not trigger
> reconciliation for such operations.
> 
> This patch fixes the handler in the master so that such dropped
> operations are reconciled.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 
>   src/tests/master_slave_reconciliation_tests.cpp 6bb4263323bcfd191c8e3c1ccba10a240e9ddd83 
> 
> 
> Diff: https://reviews.apache.org/r/66908/diff/1/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> The test in this patch fails without the corresponding master change, but succeeds with it applied.
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>


Re: Review Request 66908: Correctly reconciled dropped operation after agent failover.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66908/#review202281
-----------------------------------------------------------



FAIL: Some of the unit tests failed. Please check the relevant logs.

Reviews applied: `['66908']`

Failed command: `Start-MesosCITesting`

All the build artifacts available at: http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/66908

Relevant logs:

- [mesos-tests-stdout.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/66908/logs/mesos-tests-stdout.log):

```
[----------] 9 tests from Endpoint/SlaveEndpointTest (1119 ms total)

[----------] 2 tests from ContainerizerType/DefaultContainerDNSFlagTest
[ RUN      ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/0
[       OK ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/0 (33 ms)
[ RUN      ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/1
[       OK ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/1 (36 ms)
[----------] 2 tests from ContainerizerType/DefaultContainerDNSFlagTest (72 ms total)

[----------] 1 test from IsolationFlag/CpuIsolatorTest
[ RUN      ] IsolationFlag/CpuIsolatorTest.ROOT_UserCpuUsage/0
[       OK ] IsolationFlag/CpuIsolatorTest.ROOT_UserCpuUsage/0 (755 ms)
[----------] 1 test from IsolationFlag/CpuIsolatorTest (778 ms total)

[----------] 1 test from IsolationFlag/MemoryIsolatorTest
[ RUN      ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0
[       OK ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0 (728 ms)
[----------] 1 test from IsolationFlag/MemoryIsolatorTest (753 ms total)

[----------] Global test environment tear-down
[==========] 971 tests from 95 test cases ran. (465265 ms total)
[  PASSED  ] 969 tests.
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] ContentType/ResourceProviderManagerHttpApiTest.ResubscribeResourceProvider/0, where GetParam() = application/x-protobuf
[  FAILED  ] ContentType/ResourceProviderManagerHttpApiTest.ResubscribeResourceProvider/1, where GetParam() = application/json

 2 FAILED TESTS
  YOU HAVE 216 DISABLED TESTS

```

- [mesos-tests-stderr.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/66908/logs/mesos-tests-stderr.log):

```
I0502 15:36:04.607761 13960 master.cpp:10560] Updating the state of task c69b57f4-e357-406d-be90-501e84fffd40 of framework 027eaccf-55d8-4661-8038-7b80a2ff320e-0000 (latest state: TASK_KILLED, status update state: TASK_KILLED)
I0502 15:36:04.607761  5708 slave.cpp:3935] Shutting down framework 027eaccf-55d8-4661I0502 15:36:04.439761  7872 exec.cpp:162] Version: 1.6.0
I0502 15:36:04.466797 12484 exec.cpp:236] Executor registered on agent 027eaccf-55d8-4661-8038-7b80a2ff320e-S0
I0502 15:36:04.471794 13244 executor.cpp:177] Received SUBSCRIBED event
I0502 15:36:04.475795 13244 executor.cpp:181] Subscribed executor on winbldsrv-01.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net
I0502 15:36:04.476796 13244 executor.cpp:177] Received LAUNCH event
I0502 15:36:04.481798 13244 executor.cpp:649] Starting task c69b57f4-e357-406d-be90-501e84fffd40
I0502 15:36:04.567798 13244 executor.cpp:484] Running 'D:\DCOS\mesos\src\mesos-containerizer.exe launch <POSSIBLY-SENSITIVE-DATA>'
I0502 15:36:04.581775 13244 executor.cpp:662] Forked command at 11528
I0502 15:36:04.610782 10228 exec.cpp:445] Executor asked to shutdown
I0502 15:36:04.610782  9188 executor.cpp:177] Received SHUTDOWN event
I0502 15:36:04.610782  9188 executor.cpp:759] Shutting down
I0502 15:36:04.610782  9188 executor.cpp:869] Sending SIGTERM to process tree at pid -8038-7b80a2ff320e-0000
I0502 15:36:04.608811  5708 slave.cpp:6644] Shutting down executor 'c69b57f4-e357-406d-be90-501e84fffd40' of framework 027eaccf-55d8-4661-8038-7b80a2ff320e-0000 at executor(1)@10.3.1.8:52233
I0502 15:36:04.610782  5708 slave.cpp:929] Agent terminating
I0502 15:36:04.610782 13960 master.cpp:10659] Removing task c69b57f4-e357-406d-be90-501e84fffd40 with resources cpus(allocated: *):4; mem(allocated: *):2048; disk(allocated: *):1024; ports(allocated: *):[31000-32000] of framework 027eaccf-55d8-4661-8038-7b80a2ff320e-0000 on agent 027eaccf-55d8-4661-8038-7b80a2ff320e-S0 at slave(439)@10.3.1.8:52212 (winbldsrv-01.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net)
W0502 15:36:04.610782  5708 slave.cpp:3931] Ignoring shutdown framework 027eaccf-55d8-4661-8038-7b80a2ff320e-0000 because it is terminating
I0502 15:36:04.612776 12716 master.cpp:1296] Agent 027eaccf-55d8-4661-8038-7b80a2ff320e-S0 at slave(439)@10.3.1.8:52212 (winbldsrv-01.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net) disconnected
I0502 15:36:04.613785 12716 master.cpp:3296] Disconnecting agent 027eaccf-55d8-4661-8038-7b80a2ff320e-S0 at slave(439)@10.3.1.8:52212 (winbldsrv-01.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net)
I0502 15:36:04.613785 12716 master.cpp:3315] Deactivating agent 027eaccf-55d8-4661-8038-7b80a2ff320e-S0 at slave(439)@10.3.1.8:52212 (winbldsrv-01.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net)
I0502 15:36:04.615784 12648 hierarchical.cpp:344] Removed framework 027eaccf-55d8-4661-8038-7b80a2ff320e-0000
I0502 15:36:04.615784  5708 containerizer.cpp:2378] Destroying container c928baac-63f6-4190-a663-59f6a42ce7ab in RUNNING state
I0502 15:36:04.615784  5708 containerizer.cpp:2992] Transitioning the state of container c928baac-63f6-4190-a663-59f6a42ce7ab from RUNNING to DESTROYING
I0502 15:36:04.615784 12648 hierarchical.cpp:766] Agent 027eaccf-55d8-4661-8038-7b80a2ff320e-S0 deactivated
I0502 15:36:04.616801  5708 launcher.cpp:156] Asked to destroy container c928baac-63f6-4190-a663-59f6a42ce7ab
I0502 15:36:04.647410 12716 containerizer.cpp:2831] Container c928baac-63f6-4190-a663-59f6a42ce7ab has exited
I0502 15:36:04.679446 13504 master.cpp:1138] Master terminating
I0502 15:36:04.681452  8204 hierarchical.cpp:609] Removed agent 027eaccf-55d8-4661-8038-7b80a2ff320e-S0
I0502 15:36:05.073436  5492 process.cpp:940] Stopped the socket accept loop
```

- Mesos Reviewbot Windows


On May 2, 2018, 2:35 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66908/
> -----------------------------------------------------------
> 
> (Updated May 2, 2018, 2:35 p.m.)
> 
> 
> Review request for mesos, Gaston Kleiman and Greg Mann.
> 
> 
> Bugs: MESOS-8870
>     https://issues.apache.org/jira/browse/MESOS-8870
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When the master receives an `UpdateSlaveMessage` after agent failover
> it previously did not correctly detect dropped operations (operations
> known to the master, but unknown to the agent) and did not trigger
> reconciliation for such operations.
> 
> This patch fixes the handler in the master so that such dropped
> operations are reconciled.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 
>   src/tests/master_slave_reconciliation_tests.cpp 6bb4263323bcfd191c8e3c1ccba10a240e9ddd83 
> 
> 
> Diff: https://reviews.apache.org/r/66908/diff/1/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> The test in this patch fails without the corresponding master change, but succeeds with it applied.
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>


Re: Review Request 66908: Correctly reconciled dropped operation after agent failover.

Posted by Benjamin Bannier <be...@mesosphere.io>.

> On May 3, 2018, 7 p.m., Greg Mann wrote:
> > src/master/master.cpp
> > Lines 7569 (patched)
> > <https://reviews.apache.org/r/66908/diff/1/?file=2015914#file2015914line7569>
> >
> >     I think the following is appropriate?
> >     
> >     s/after a master failover/after a master or agent failover/

I think not? This paragraph describes two different scenarios (master failover or agent failover), and the two parts are explained separately.

How about the following -- note the different punctation,

    // Below we loop over all received operations and check whether
    // they are known to the master; operations can be unknown to the
    // master after a master failover. To handle dropped operations on
    // agent failover we explicitly track the received operations and
    // compare them against the operations known to the master.


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66908/#review202361
-----------------------------------------------------------


On May 4, 2018, 12:37 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66908/
> -----------------------------------------------------------
> 
> (Updated May 4, 2018, 12:37 p.m.)
> 
> 
> Review request for mesos, Gaston Kleiman and Greg Mann.
> 
> 
> Bugs: MESOS-8870
>     https://issues.apache.org/jira/browse/MESOS-8870
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When the master receives an `UpdateSlaveMessage` after agent failover
> it previously did not correctly detect dropped operations (operations
> known to the master, but unknown to the agent) and did not trigger
> reconciliation for such operations.
> 
> This patch fixes the handler in the master so that such dropped
> operations are reconciled.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 7a2f69c1fe2508e16c2685cd3490d5b1f59d6ac4 
>   src/tests/master_slave_reconciliation_tests.cpp 6bb4263323bcfd191c8e3c1ccba10a240e9ddd83 
> 
> 
> Diff: https://reviews.apache.org/r/66908/diff/2/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> The test in this patch fails without the corresponding master change, but succeeds with it applied.
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>


Re: Review Request 66908: Correctly reconciled dropped operation after agent failover.

Posted by Greg Mann <gr...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66908/#review202361
-----------------------------------------------------------


Fix it, then Ship it!




LGTM! Benjamin, do you think you could update this during your day tomorrow so that we can land it before the 1.6 RC cut?


src/master/master.cpp
Lines 7569 (patched)
<https://reviews.apache.org/r/66908/#comment284149>

    I think the following is appropriate?
    
    s/after a master failover/after a master or agent failover/


- Greg Mann


On May 2, 2018, 2:35 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66908/
> -----------------------------------------------------------
> 
> (Updated May 2, 2018, 2:35 p.m.)
> 
> 
> Review request for mesos, Gaston Kleiman and Greg Mann.
> 
> 
> Bugs: MESOS-8870
>     https://issues.apache.org/jira/browse/MESOS-8870
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> When the master receives an `UpdateSlaveMessage` after agent failover
> it previously did not correctly detect dropped operations (operations
> known to the master, but unknown to the agent) and did not trigger
> reconciliation for such operations.
> 
> This patch fixes the handler in the master so that such dropped
> operations are reconciled.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 
>   src/tests/master_slave_reconciliation_tests.cpp 6bb4263323bcfd191c8e3c1ccba10a240e9ddd83 
> 
> 
> Diff: https://reviews.apache.org/r/66908/diff/1/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> The test in this patch fails without the corresponding master change, but succeeds with it applied.
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>