You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Gastón Kleiman <ga...@mesosphere.io> on 2019/02/13 22:49:14 UTC

Review Request 69978: Added garbage collection of terminated operations status update streams.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/
-----------------------------------------------------------

Review request for mesos, Chun-Hung Hsiao and Greg Mann.


Bugs: MESOS-9574
    https://issues.apache.org/jira/browse/MESOS-9574


Repository: mesos


Description
-------

Make the agent garbage collect an operation stauts update stream once
a terminal status update is acknowledged.

This patch also improves the logging of acknowledgmenet failures and the
readability of the `Slave::operationStatusAcknowledgement` method.


Diffs
-----

  src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 


Diff: https://reviews.apache.org/r/69978/diff/1/


Testing
-------

Manual testing + existing tests still pass.


Thanks,

Gastón Kleiman


Re: Review Request 69978: Added garbage collection of terminated operations status update streams.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/#review212816
-----------------------------------------------------------



PASS: Mesos patch 69978 was successfully built and tested.

Reviews applied: `['69977', '69978']`

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2886/mesos-review-69978

- Mesos Reviewbot Windows


On Feb. 13, 2019, 3:15 p.m., Gastón Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69978/
> -----------------------------------------------------------
> 
> (Updated Feb. 13, 2019, 3:15 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-9574
>     https://issues.apache.org/jira/browse/MESOS-9574
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Make the agent garbage collect an operation status update stream once
> a terminal status update is acknowledged.
> 
> This patch also improves the logging of acknowledgement failures and the
> readability of the `Slave::operationStatusAcknowledgement` method.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/69978/diff/2/
> 
> 
> Testing
> -------
> 
> Manual testing + existing tests still pass.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>


Re: Review Request 69978: Added garbage collection of terminated operations status update streams.

Posted by Gastón Kleiman <ga...@mesosphere.io>.

> On Feb. 15, 2019, 3:50 p.m., Greg Mann wrote:
> > src/slave/slave.cpp
> > Line 4711 (original), 4735 (patched)
> > <https://reviews.apache.org/r/69978/diff/2/?file=2125182#file2125182line4735>
> >
> >     By moving this line into the `.then()` continuation, we will no longer remove the operation if there is an error when calling `operationStatusUpdateManager.acknowledgement()`; is that what we want? Perhaps we should add a `removeOperation()` call in the `err()` helper as well?

I think the patch is good as-is — if we add a removeOperation() call in the err() helper, the framework won't be able to ack the update in the case that there is an error calling operationStatusUpdateManager.acknowledgement() which prevents the SUM from checkpointing the ack and cleaning up the stream, causing it to resend the status update.


That means that the agent would keep resending the update until the stream is garbage collected the next time the agent recovers.


Your comment however made me notice another corner case that the SLRP handles and I had missed: the agent could fail over right before executing this lambda. It should notice that during recovery and garbage collect the stream — I'm adding a new patch to this chain that does this.


- Gastón


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/#review212881
-----------------------------------------------------------


On Feb. 13, 2019, 3:15 p.m., Gastón Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69978/
> -----------------------------------------------------------
> 
> (Updated Feb. 13, 2019, 3:15 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-9574
>     https://issues.apache.org/jira/browse/MESOS-9574
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Make the agent garbage collect an operation status update stream once
> a terminal status update is acknowledged.
> 
> This patch also improves the logging of acknowledgement failures and the
> readability of the `Slave::operationStatusAcknowledgement` method.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/69978/diff/2/
> 
> 
> Testing
> -------
> 
> Manual testing + existing tests still pass.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>


Re: Review Request 69978: Added garbage collection of terminated operations status update streams.

Posted by Gastón Kleiman <ga...@mesosphere.io>.

> On Feb. 15, 2019, 3:50 p.m., Greg Mann wrote:
> > src/slave/slave.cpp
> > Line 4711 (original), 4735 (patched)
> > <https://reviews.apache.org/r/69978/diff/2/?file=2125182#file2125182line4735>
> >
> >     By moving this line into the `.then()` continuation, we will no longer remove the operation if there is an error when calling `operationStatusUpdateManager.acknowledgement()`; is that what we want? Perhaps we should add a `removeOperation()` call in the `err()` helper as well?
> 
> Gastón Kleiman wrote:
>     I think the patch is good as-is — if we add a removeOperation() call in the err() helper, the framework won't be able to ack the update in the case that there is an error calling operationStatusUpdateManager.acknowledgement() which prevents the SUM from checkpointing the ack and cleaning up the stream, causing it to resend the status update.
>     
>     
>     That means that the agent would keep resending the update until the stream is garbage collected the next time the agent recovers.
>     
>     
>     
>     Your comment however made me notice another corner case that the SLRP handles and I had missed: the agent could fail over right before executing this lambda. It should notice that during recovery and garbage collect the stream — I'm adding a new patch to this chain that does this.

Note: I ended up squashing the new patch with this one.


- Gastón


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/#review212881
-----------------------------------------------------------


On Feb. 20, 2019, 4:43 p.m., Gastón Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69978/
> -----------------------------------------------------------
> 
> (Updated Feb. 20, 2019, 4:43 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-9574
>     https://issues.apache.org/jira/browse/MESOS-9574
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Make the agent garbage collect an operation status update stream once
> a terminal status update is acknowledged.
> 
> This patch also improves the logging of acknowledgement failures and the
> readability of the `Slave::operationStatusAcknowledgement` method.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/69978/diff/3/
> 
> 
> Testing
> -------
> 
> Manual testing + existing tests still pass.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>


Re: Review Request 69978: Added garbage collection of terminated operations status update streams.

Posted by Greg Mann <gr...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/#review212881
-----------------------------------------------------------




src/slave/slave.cpp
Line 4711 (original), 4735 (patched)
<https://reviews.apache.org/r/69978/#comment298765>

    By moving this line into the `.then()` continuation, we will no longer remove the operation if there is an error when calling `operationStatusUpdateManager.acknowledgement()`; is that what we want? Perhaps we should add a `removeOperation()` call in the `err()` helper as well?


- Greg Mann


On Feb. 13, 2019, 11:15 p.m., Gastón Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69978/
> -----------------------------------------------------------
> 
> (Updated Feb. 13, 2019, 11:15 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-9574
>     https://issues.apache.org/jira/browse/MESOS-9574
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Make the agent garbage collect an operation status update stream once
> a terminal status update is acknowledged.
> 
> This patch also improves the logging of acknowledgement failures and the
> readability of the `Slave::operationStatusAcknowledgement` method.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/69978/diff/2/
> 
> 
> Testing
> -------
> 
> Manual testing + existing tests still pass.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>


Re: Review Request 69978: Added garbage collection of terminated operations status update streams.

Posted by Mesos Reviewbot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/#review212817
-----------------------------------------------------------



Patch looks great!

Reviews applied: [69977, 69978]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose --disable-libtool-wrappers --disable-parallel-test-execution' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh

- Mesos Reviewbot


On Feb. 13, 2019, 3:15 p.m., Gastón Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69978/
> -----------------------------------------------------------
> 
> (Updated Feb. 13, 2019, 3:15 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-9574
>     https://issues.apache.org/jira/browse/MESOS-9574
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Make the agent garbage collect an operation status update stream once
> a terminal status update is acknowledged.
> 
> This patch also improves the logging of acknowledgement failures and the
> readability of the `Slave::operationStatusAcknowledgement` method.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/69978/diff/2/
> 
> 
> Testing
> -------
> 
> Manual testing + existing tests still pass.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>


Re: Review Request 69978: Added garbage collection of terminated operations status update streams.

Posted by Gastón Kleiman <ga...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/#review212999
-----------------------------------------------------------




src/slave/slave.cpp
Line 4711 (original), 4735 (patched)
<https://reviews.apache.org/r/69978/#comment298854>

    I think the patch is good as-is — if we add a `removeOperation()` call in the `err()` helper, the framework won't be able to ack the update in the case that there is an error calling `operationStatusUpdateManager.acknowledgement()` which prevents the SUM from checkpointing the ack and cleaning up the stream, causing it to resend the status update.
    
    That means that the agent would keep resending the update until the stream is garbage collected the next time the agent recovers.
    
    Your comment however made me notice another corner case that the SLRP handles and I had missed: the agent could fail over right before executing this lambda. It should notice that during recovery and garbage collect the stream — I'm adding a new patch to this chain that does this.


- Gastón Kleiman


On Feb. 13, 2019, 3:15 p.m., Gastón Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69978/
> -----------------------------------------------------------
> 
> (Updated Feb. 13, 2019, 3:15 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-9574
>     https://issues.apache.org/jira/browse/MESOS-9574
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Make the agent garbage collect an operation status update stream once
> a terminal status update is acknowledged.
> 
> This patch also improves the logging of acknowledgement failures and the
> readability of the `Slave::operationStatusAcknowledgement` method.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/69978/diff/2/
> 
> 
> Testing
> -------
> 
> Manual testing + existing tests still pass.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>


Re: Review Request 69978: Added garbage collection of terminated operations status update streams.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/#review213074
-----------------------------------------------------------



FAIL: Some of the unit tests failed. Please check the relevant logs.

Reviews applied: `['69977', '69978']`

Failed command: `Start-MesosCITesting`

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2921/mesos-review-69978

Relevant logs:

- [mesos-tests.log](http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2921/mesos-review-69978/logs/mesos-tests.log):

```
I0222 04:04:22.935590 68560 master.cpp:1269] Agent a9c4bb56-3b65-4b85-9d3a-6a8bcd8aebc5-S0 at slave(494)@192.10.1.6:62107 (windows-02.chtsmhjxogyevckjfayqqcnjda.xx.internal.cloudapp.net) disconnected
I0222 04:04:22.935590 68560 master.cpp:3292] Disconnecting agent a9c4bb56-3b65-4b85-9d3a-6a8bcd8aebc5-S0 at slave(494)@192.10.1.6:62107 (windows-02.chtsmhjxogyevckjfayqqcnjda.xx.internal.cloudapp.net)
I0222 04:04:22.936599 68560 master.cpp:3311] Deactivating agent a9c4bb56-3b65-4b85-9d3a-6a8bcd8aebc5-S0 at slave(494)@192.10.1.6:62107 (windows-02.chtsmhjxogyevckjfayqqcnjda.xx.internal.cloudapp.net)
I0222 04:04:22.936599 61696 hierarchical.cpp:390] Removed framework a9c4bb56-3b65-4b85-9d3a-6a8bcd8aebc5-0000
I0222 04:04:22.936599 61696 hierarchical.cpp:827] Agent a9c4bb56-3b65-4b85-9d3a-6a8bcd8aebc5-S0 deactivated
I0222 04:04:22.938586 61696 containerizer.cpp:2526] Destroying container d8552e8b-81d5-4150-88ed-3e6a1e37f81d in RUNNING state
I0222 04:04:22.938586 61696 containerizer.cpp:3193] Transitioning the state of container d8552e8b-81d5-4150-88ed-3e6a1e37f81d from RUNNING to DESTROYING
I0222 04:04:22.942581 61696 launcher.cpp:161] Asked to destroy container d8552e8b-81d5-4150-88ed-3e6a1e37f81d
W0222 04:04:22.943586 66452 process.cpp:1423] Failed to recv on socket WindowsFD::Type::SOCKET=14500 to peer '192.10.1.6:64452': IO failed with error code: The specified network name is no longer available.

W0222 04:04:22.943586 66452 process.cpp:838] Failed to recv on socket WindowsFD::Type::SOCKET=13924 to peer '192.10.1.6:64453': IO failed with error code: The specified network name is no longer available.

I0222 04:04:22.946585 66948 containerizer.cpp:3032] Container d8552e8b-81d5-4150-88ed-3e6a1e37f81d has exited
I0222 04:04:22.977591 67480 master.cpp:1109] Master terminating
I0222 04:04:22.978583 61120 hierarchical.cpp:678[       OK ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0 (683 ms)
[----------] 1 test from IsolationFlag/MemoryIsolatorTest (701 ms total)

[----------] Global test environment tear-down
[==========] 1115 tests from 106 test cases ran. (578594 ms total)
[  PASSED  ] 1113 tests.
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] DockerFetcherPluginTest.INTERNET_CURL_FetchImage
[  FAILED  ] DockerFetcherPluginTest.INTERNET_CURL_InvokeFetchByName

 2 FAILED TESTS
  YOU HAVE 232 DISABLED TESTS

] Removed agent a9c4bb56-3b65-4b85-9d3a-6a8bcd8aebc5-S0
I0222 04:04:23.892592 66452 process.cpp:927] Stopped the socket accept loop
```

- Mesos Reviewbot Windows


On Feb. 21, 2019, 5:56 p.m., Gastón Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69978/
> -----------------------------------------------------------
> 
> (Updated Feb. 21, 2019, 5:56 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-9574
>     https://issues.apache.org/jira/browse/MESOS-9574
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Make the agent garbage collect an operation status update stream once
> a terminal status update is acknowledged.
> 
> This patch also improves the logging of acknowledgement failures and the
> readability of the `Slave::operationStatusAcknowledgement` method.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/69978/diff/4/
> 
> 
> Testing
> -------
> 
> Manual testing + existing tests still pass.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>


Re: Review Request 69978: Added garbage collection of terminated operations status update streams.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/#review213105
-----------------------------------------------------------



PASS: Mesos patch 69978 was successfully built and tested.

Reviews applied: `['69977', '69978']`

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2923/mesos-review-69978

- Mesos Reviewbot Windows


On Feb. 22, 2019, 5:28 p.m., Gastón Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69978/
> -----------------------------------------------------------
> 
> (Updated Feb. 22, 2019, 5:28 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-9574
>     https://issues.apache.org/jira/browse/MESOS-9574
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Make the agent garbage collect an operation status update stream once
> a terminal status update is acknowledged.
> 
> This patch also improves the logging of acknowledgement failures and the
> readability of the `Slave::operationStatusAcknowledgement` method.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/69978/diff/5/
> 
> 
> Testing
> -------
> 
> Manual testing + existing tests still pass.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>


Re: Review Request 69978: Added garbage collection of terminated operations status update streams.

Posted by Greg Mann <gr...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/#review213204
-----------------------------------------------------------


Ship it!




Ship It!

- Greg Mann


On Feb. 23, 2019, 1:28 a.m., Gastón Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69978/
> -----------------------------------------------------------
> 
> (Updated Feb. 23, 2019, 1:28 a.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-9574
>     https://issues.apache.org/jira/browse/MESOS-9574
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Make the agent garbage collect an operation status update stream once
> a terminal status update is acknowledged.
> 
> This patch also improves the logging of acknowledgement failures and the
> readability of the `Slave::operationStatusAcknowledgement` method.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/69978/diff/5/
> 
> 
> Testing
> -------
> 
> Manual testing + existing tests still pass.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>


Re: Review Request 69978: Added garbage collection of terminated operations status update streams.

Posted by Gastón Kleiman <ga...@mesosphere.io>.

> On Feb. 26, 2019, 3:16 p.m., Joseph Wu wrote:
> > src/slave/slave.cpp
> > Lines 4694 (patched)
> > <https://reviews.apache.org/r/69978/diff/5/?file=2126441#file2126441line4694>
> >
> >     Is there any reason why this is not the following?
> >     ```
> >     operation->latest_status().state()
> >     ```

Not that I can think of, changing it would be a nice readibility improvement =).


- Gastón


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/#review213234
-----------------------------------------------------------


On Feb. 22, 2019, 5:28 p.m., Gastón Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69978/
> -----------------------------------------------------------
> 
> (Updated Feb. 22, 2019, 5:28 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-9574
>     https://issues.apache.org/jira/browse/MESOS-9574
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Make the agent garbage collect an operation status update stream once
> a terminal status update is acknowledged.
> 
> This patch also improves the logging of acknowledgement failures and the
> readability of the `Slave::operationStatusAcknowledgement` method.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/69978/diff/5/
> 
> 
> Testing
> -------
> 
> Manual testing + existing tests still pass.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>


Re: Review Request 69978: Added garbage collection of terminated operations status update streams.

Posted by Joseph Wu <jo...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/#review213234
-----------------------------------------------------------



Fly by comment!


src/slave/slave.cpp
Lines 4694 (patched)
<https://reviews.apache.org/r/69978/#comment299138>

    Is there any reason why this is not the following?
    ```
    operation->latest_status().state()
    ```


- Joseph Wu


On Feb. 22, 2019, 5:28 p.m., Gastón Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69978/
> -----------------------------------------------------------
> 
> (Updated Feb. 22, 2019, 5:28 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-9574
>     https://issues.apache.org/jira/browse/MESOS-9574
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Make the agent garbage collect an operation status update stream once
> a terminal status update is acknowledged.
> 
> This patch also improves the logging of acknowledgement failures and the
> readability of the `Slave::operationStatusAcknowledgement` method.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/69978/diff/5/
> 
> 
> Testing
> -------
> 
> Manual testing + existing tests still pass.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>


Re: Review Request 69978: Added garbage collection of terminated operations status update streams.

Posted by Gastón Kleiman <ga...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/
-----------------------------------------------------------

(Updated Feb. 22, 2019, 5:28 p.m.)


Review request for mesos, Chun-Hung Hsiao and Greg Mann.


Changes
-------

Removed unnecessary checkpointing (`Slave::removeOperation()` already checkpoints).


Bugs: MESOS-9574
    https://issues.apache.org/jira/browse/MESOS-9574


Repository: mesos


Description
-------

Make the agent garbage collect an operation status update stream once
a terminal status update is acknowledged.

This patch also improves the logging of acknowledgement failures and the
readability of the `Slave::operationStatusAcknowledgement` method.


Diffs (updated)
-----

  src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 


Diff: https://reviews.apache.org/r/69978/diff/5/

Changes: https://reviews.apache.org/r/69978/diff/4-5/


Testing
-------

Manual testing + existing tests still pass.


Thanks,

Gastón Kleiman


Re: Review Request 69978: Added garbage collection of terminated operations status update streams.

Posted by Gastón Kleiman <ga...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/
-----------------------------------------------------------

(Updated Feb. 21, 2019, 5:56 p.m.)


Review request for mesos, Chun-Hung Hsiao and Greg Mann.


Changes
-------

Addressed feedback.


Bugs: MESOS-9574
    https://issues.apache.org/jira/browse/MESOS-9574


Repository: mesos


Description
-------

Make the agent garbage collect an operation status update stream once
a terminal status update is acknowledged.

This patch also improves the logging of acknowledgement failures and the
readability of the `Slave::operationStatusAcknowledgement` method.


Diffs (updated)
-----

  src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 


Diff: https://reviews.apache.org/r/69978/diff/4/

Changes: https://reviews.apache.org/r/69978/diff/3-4/


Testing
-------

Manual testing + existing tests still pass.


Thanks,

Gastón Kleiman


Re: Review Request 69978: Added garbage collection of terminated operations status update streams.

Posted by Gastón Kleiman <ga...@mesosphere.io>.

> On Feb. 20, 2019, 6:13 p.m., Greg Mann wrote:
> > src/slave/slave.cpp
> > Lines 7452 (patched)
> > <https://reviews.apache.org/r/69978/diff/3/?file=2126000#file2126000line7457>
> >
> >     Is it possible that the stream could still be present on disk, but the operation would not be found in the checkpoint file?
> >     
> >     Since we use the directories found on disk to recover the SUM but we use the recovered operation state when calling `addOperation()`, perhaps this call to `completedOperations.push_back(uuid);` should be moved outside of this conditional, so that we GC the stream regardless of whether or not the operation was present in the agent's checkpoint file?

This RR depends on https://reviews.apache.org/r/69977/ which makes `Slave::_recoverOperations()` handle that case.


- Gastón


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/#review213008
-----------------------------------------------------------


On Feb. 20, 2019, 4:43 p.m., Gastón Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69978/
> -----------------------------------------------------------
> 
> (Updated Feb. 20, 2019, 4:43 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-9574
>     https://issues.apache.org/jira/browse/MESOS-9574
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Make the agent garbage collect an operation status update stream once
> a terminal status update is acknowledged.
> 
> This patch also improves the logging of acknowledgement failures and the
> readability of the `Slave::operationStatusAcknowledgement` method.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/69978/diff/3/
> 
> 
> Testing
> -------
> 
> Manual testing + existing tests still pass.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>


Re: Review Request 69978: Added garbage collection of terminated operations status update streams.

Posted by Greg Mann <gr...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/#review213008
-----------------------------------------------------------




src/slave/slave.cpp
Lines 7452 (patched)
<https://reviews.apache.org/r/69978/#comment298861>

    Is it possible that the stream could still be present on disk, but the operation would not be found in the checkpoint file?
    
    Since we use the directories found on disk to recover the SUM but we use the recovered operation state when calling `addOperation()`, perhaps this call to `completedOperations.push_back(uuid);` should be moved outside of this conditional, so that we GC the stream regardless of whether or not the operation was present in the agent's checkpoint file?


- Greg Mann


On Feb. 21, 2019, 12:43 a.m., Gastón Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69978/
> -----------------------------------------------------------
> 
> (Updated Feb. 21, 2019, 12:43 a.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-9574
>     https://issues.apache.org/jira/browse/MESOS-9574
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Make the agent garbage collect an operation status update stream once
> a terminal status update is acknowledged.
> 
> This patch also improves the logging of acknowledgement failures and the
> readability of the `Slave::operationStatusAcknowledgement` method.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/69978/diff/3/
> 
> 
> Testing
> -------
> 
> Manual testing + existing tests still pass.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>


Re: Review Request 69978: Added garbage collection of terminated operations status update streams.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/#review213011
-----------------------------------------------------------



PASS: Mesos patch 69978 was successfully built and tested.

Reviews applied: `['69977', '69978']`

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2916/mesos-review-69978

- Mesos Reviewbot Windows


On Feb. 20, 2019, 4:43 p.m., Gastón Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69978/
> -----------------------------------------------------------
> 
> (Updated Feb. 20, 2019, 4:43 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-9574
>     https://issues.apache.org/jira/browse/MESOS-9574
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Make the agent garbage collect an operation status update stream once
> a terminal status update is acknowledged.
> 
> This patch also improves the logging of acknowledgement failures and the
> readability of the `Slave::operationStatusAcknowledgement` method.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/69978/diff/3/
> 
> 
> Testing
> -------
> 
> Manual testing + existing tests still pass.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>


Re: Review Request 69978: Added garbage collection of terminated operations status update streams.

Posted by Greg Mann <gr...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/#review213063
-----------------------------------------------------------


Fix it, then Ship it!





src/slave/slave.cpp
Lines 4739-4740 (patched)
<https://reviews.apache.org/r/69978/#comment298934>

    Nit: newline before `metaDir`:
    
    ```
    const string path = slave::paths::getOperationPath(
        metaDir,
        operationUuid.get());
    ```



src/slave/slave.cpp
Lines 7438 (patched)
<https://reviews.apache.org/r/69978/#comment298935>

    s/still still/still/


- Greg Mann


On Feb. 21, 2019, 12:43 a.m., Gastón Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69978/
> -----------------------------------------------------------
> 
> (Updated Feb. 21, 2019, 12:43 a.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-9574
>     https://issues.apache.org/jira/browse/MESOS-9574
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Make the agent garbage collect an operation status update stream once
> a terminal status update is acknowledged.
> 
> This patch also improves the logging of acknowledgement failures and the
> readability of the `Slave::operationStatusAcknowledgement` method.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 
> 
> 
> Diff: https://reviews.apache.org/r/69978/diff/3/
> 
> 
> Testing
> -------
> 
> Manual testing + existing tests still pass.
> 
> 
> Thanks,
> 
> Gastón Kleiman
> 
>


Re: Review Request 69978: Added garbage collection of terminated operations status update streams.

Posted by Gastón Kleiman <ga...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/
-----------------------------------------------------------

(Updated Feb. 20, 2019, 4:43 p.m.)


Review request for mesos, Chun-Hung Hsiao and Greg Mann.


Changes
-------

Handled a missed case.


Bugs: MESOS-9574
    https://issues.apache.org/jira/browse/MESOS-9574


Repository: mesos


Description
-------

Make the agent garbage collect an operation status update stream once
a terminal status update is acknowledged.

This patch also improves the logging of acknowledgement failures and the
readability of the `Slave::operationStatusAcknowledgement` method.


Diffs (updated)
-----

  src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 


Diff: https://reviews.apache.org/r/69978/diff/3/

Changes: https://reviews.apache.org/r/69978/diff/2-3/


Testing
-------

Manual testing + existing tests still pass.


Thanks,

Gastón Kleiman


Re: Review Request 69978: Added garbage collection of terminated operations status update streams.

Posted by Gastón Kleiman <ga...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69978/
-----------------------------------------------------------

(Updated Feb. 13, 2019, 3:15 p.m.)


Review request for mesos, Chun-Hung Hsiao and Greg Mann.


Changes
-------

Fixed typos in the commit message.


Bugs: MESOS-9574
    https://issues.apache.org/jira/browse/MESOS-9574


Repository: mesos


Description (updated)
-------

Make the agent garbage collect an operation status update stream once
a terminal status update is acknowledged.

This patch also improves the logging of acknowledgement failures and the
readability of the `Slave::operationStatusAcknowledgement` method.


Diffs (updated)
-----

  src/slave/slave.cpp e3c2c005d865b5c333e92e50e49ef398fe06ad79 


Diff: https://reviews.apache.org/r/69978/diff/2/

Changes: https://reviews.apache.org/r/69978/diff/1-2/


Testing
-------

Manual testing + existing tests still pass.


Thanks,

Gastón Kleiman