You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Jie Yu <yu...@gmail.com> on 2017/12/14 01:38:16 UTC

Review Request 64589: Fixed a bug related to offer operation application in the agent.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64589/
-----------------------------------------------------------

Review request for mesos, Benjamin Bannier, Greg Mann, and Jan Schlicht.


Repository: mesos


Description
-------

The bug is introduced in this patch:
https://reviews.apache.org/r/64477/

Given that we also keep track of each resource provider's total
resources in addition to the total resources of the agent, we need to
make sure we update those totals after applying an operation.

The bug may manifest as a CHECK failure in the agent that checks if
`totalResources` of the agent is a super set of all the resource
provider resources.


Diffs
-----

  src/slave/slave.hpp de2b2e2e81ed860e6a33ce1b93d859d816ba1021 
  src/slave/slave.cpp e8f7591dc0d57ca8a0eb72f6c1c008d4005a524d 


Diff: https://reviews.apache.org/r/64589/diff/1/


Testing
-------

make check


Thanks,

Jie Yu


Re: Review Request 64589: Fixed a bug related to offer operation application in the agent.

Posted by Jie Yu <yu...@gmail.com>.

> On Dec. 14, 2017, 5:33 a.m., Chun-Hung Hsiao wrote:
> > src/slave/slave.cpp
> > Line 3847 (original), 3847 (patched)
> > <https://reviews.apache.org/r/64589/diff/1/?file=1915763#file1915763line3847>
> >
> >     I feel like there are too many helper functions in different places. Is there any uses of `stripAllocationInfo` other than applying resource conversions? If not then it seems to make more sense to me that we have one strip or unallocate function and a get RP ID method for `ResourceConversion`, then keep the original prototype for `Slave::apply`. Otherwise LGTM to this patch.

Just ResourceConversion is not sufficient, i need to get the resource provider ID, and I don't want to infer that from ResourceConversions.


- Jie


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64589/#review193769
-----------------------------------------------------------


On Dec. 14, 2017, 1:38 a.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64589/
> -----------------------------------------------------------
> 
> (Updated Dec. 14, 2017, 1:38 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Greg Mann, and Jan Schlicht.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The bug is introduced in this patch:
> https://reviews.apache.org/r/64477/
> 
> Given that we also keep track of each resource provider's total
> resources in addition to the total resources of the agent, we need to
> make sure we update those totals after applying an operation.
> 
> The bug may manifest as a CHECK failure in the agent that checks if
> `totalResources` of the agent is a super set of all the resource
> provider resources.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp de2b2e2e81ed860e6a33ce1b93d859d816ba1021 
>   src/slave/slave.cpp e8f7591dc0d57ca8a0eb72f6c1c008d4005a524d 
> 
> 
> Diff: https://reviews.apache.org/r/64589/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>


Re: Review Request 64589: Fixed a bug related to offer operation application in the agent.

Posted by Chun-Hung Hsiao <ch...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64589/#review193769
-----------------------------------------------------------


Fix it, then Ship it!





src/slave/slave.cpp
Line 3847 (original), 3847 (patched)
<https://reviews.apache.org/r/64589/#comment272411>

    I feel like there are too many helper functions in different places. Is there any uses of `stripAllocationInfo` other than applying resource conversions? If not then it seems to make more sense to me that we have one strip or unallocate function and a get RP ID method for `ResourceConversion`, then keep the original prototype for `Slave::apply`. Otherwise LGTM to this patch.


- Chun-Hung Hsiao


On Dec. 14, 2017, 1:38 a.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64589/
> -----------------------------------------------------------
> 
> (Updated Dec. 14, 2017, 1:38 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Greg Mann, and Jan Schlicht.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The bug is introduced in this patch:
> https://reviews.apache.org/r/64477/
> 
> Given that we also keep track of each resource provider's total
> resources in addition to the total resources of the agent, we need to
> make sure we update those totals after applying an operation.
> 
> The bug may manifest as a CHECK failure in the agent that checks if
> `totalResources` of the agent is a super set of all the resource
> provider resources.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp de2b2e2e81ed860e6a33ce1b93d859d816ba1021 
>   src/slave/slave.cpp e8f7591dc0d57ca8a0eb72f6c1c008d4005a524d 
> 
> 
> Diff: https://reviews.apache.org/r/64589/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>


Re: Review Request 64589: Fixed a bug related to offer operation application in the agent.

Posted by Jie Yu <yu...@gmail.com>.

> On Dec. 14, 2017, 8:35 a.m., Benjamin Bannier wrote:
> > src/slave/slave.cpp
> > Lines 7506 (patched)
> > <https://reviews.apache.org/r/64589/diff/1/?file=1915763#file1915763line7528>
> >
> >     This `CHECK` is not sufficient to make sure that `converted_resources` is set -- the operation could e.g., be terminal and failed.
> >     
> >     Could we instead assert that the operation is `OFFER_OPERTATION_FINISHED`? Alternatively we might also make sure that every terminal operation sets `converted_resources`, in case of failures e.g., to `consumed`.
> 
> Chun-Hung Hsiao wrote:
>     Agreed. That's part of the reason I feel we should keep `apply` receiving `ResourceConversion`s and not duplicating the CHECK logic all around.

Just ResourceConversion is not sufficient, i need to get the resource provider ID, and I don't want to infer that from ResourceConversions.


- Jie


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64589/#review193772
-----------------------------------------------------------


On Dec. 14, 2017, 1:38 a.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64589/
> -----------------------------------------------------------
> 
> (Updated Dec. 14, 2017, 1:38 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Greg Mann, and Jan Schlicht.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The bug is introduced in this patch:
> https://reviews.apache.org/r/64477/
> 
> Given that we also keep track of each resource provider's total
> resources in addition to the total resources of the agent, we need to
> make sure we update those totals after applying an operation.
> 
> The bug may manifest as a CHECK failure in the agent that checks if
> `totalResources` of the agent is a super set of all the resource
> provider resources.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp de2b2e2e81ed860e6a33ce1b93d859d816ba1021 
>   src/slave/slave.cpp e8f7591dc0d57ca8a0eb72f6c1c008d4005a524d 
> 
> 
> Diff: https://reviews.apache.org/r/64589/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>


Re: Review Request 64589: Fixed a bug related to offer operation application in the agent.

Posted by Chun-Hung Hsiao <ch...@mesosphere.io>.

> On Dec. 14, 2017, 8:35 a.m., Benjamin Bannier wrote:
> > src/slave/slave.cpp
> > Lines 7506 (patched)
> > <https://reviews.apache.org/r/64589/diff/1/?file=1915763#file1915763line7528>
> >
> >     This `CHECK` is not sufficient to make sure that `converted_resources` is set -- the operation could e.g., be terminal and failed.
> >     
> >     Could we instead assert that the operation is `OFFER_OPERTATION_FINISHED`? Alternatively we might also make sure that every terminal operation sets `converted_resources`, in case of failures e.g., to `consumed`.

Agreed. That's part of the reason I feel we should keep `apply` receiving `ResourceConversion`s and not duplicating the CHECK logic all around.


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64589/#review193772
-----------------------------------------------------------


On Dec. 14, 2017, 1:38 a.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64589/
> -----------------------------------------------------------
> 
> (Updated Dec. 14, 2017, 1:38 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Greg Mann, and Jan Schlicht.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The bug is introduced in this patch:
> https://reviews.apache.org/r/64477/
> 
> Given that we also keep track of each resource provider's total
> resources in addition to the total resources of the agent, we need to
> make sure we update those totals after applying an operation.
> 
> The bug may manifest as a CHECK failure in the agent that checks if
> `totalResources` of the agent is a super set of all the resource
> provider resources.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp de2b2e2e81ed860e6a33ce1b93d859d816ba1021 
>   src/slave/slave.cpp e8f7591dc0d57ca8a0eb72f6c1c008d4005a524d 
> 
> 
> Diff: https://reviews.apache.org/r/64589/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>


Re: Review Request 64589: Fixed a bug related to offer operation application in the agent.

Posted by Benjamin Bannier <be...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64589/#review193772
-----------------------------------------------------------


Fix it, then Ship it!





src/slave/slave.cpp
Lines 7501 (patched)
<https://reviews.apache.org/r/64589/#comment272418>

    Let's `CHECK_SOME(_conversions)` explicitly here. This should be true since the operation is speculated.



src/slave/slave.cpp
Lines 7503 (patched)
<https://reviews.apache.org/r/64589/#comment272421>

    nit: non-speculative



src/slave/slave.cpp
Lines 7506 (patched)
<https://reviews.apache.org/r/64589/#comment272419>

    This `CHECK` is not sufficient to make sure that `converted_resources` is set -- the operation could e.g., be terminal and failed.
    
    Could we instead assert that the operation is `OFFER_OPERTATION_FINISHED`? Alternatively we might also make sure that every terminal operation sets `converted_resources`, in case of failures e.g., to `consumed`.



src/slave/slave.cpp
Lines 7532 (patched)
<https://reviews.apache.org/r/64589/#comment272420>

    Let's add a comment here that after updating `slave`'s `totalResources` we also need to update the resource provider's `totalResources`.


- Benjamin Bannier


On Dec. 14, 2017, 2:38 a.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64589/
> -----------------------------------------------------------
> 
> (Updated Dec. 14, 2017, 2:38 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Greg Mann, and Jan Schlicht.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The bug is introduced in this patch:
> https://reviews.apache.org/r/64477/
> 
> Given that we also keep track of each resource provider's total
> resources in addition to the total resources of the agent, we need to
> make sure we update those totals after applying an operation.
> 
> The bug may manifest as a CHECK failure in the agent that checks if
> `totalResources` of the agent is a super set of all the resource
> provider resources.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp de2b2e2e81ed860e6a33ce1b93d859d816ba1021 
>   src/slave/slave.cpp e8f7591dc0d57ca8a0eb72f6c1c008d4005a524d 
> 
> 
> Diff: https://reviews.apache.org/r/64589/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>


Re: Review Request 64589: Fixed a bug related to offer operation application in the agent.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64589/#review193793
-----------------------------------------------------------



FAIL: Some Mesos tests failed.

Reviews applied: `['64589']`

Failed command: `D:\DCOS\mesos\src\mesos-tests.exe --verbose`

All the build artifacts available at: http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/64589

Relevant logs:

- [mesos-tests-stdout.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/64589/logs/mesos-tests-stdout.log):

```

[----------] 1 test from IsolationFlag/CpuIsolatorTest
[ RUN      ] IsolationFlag/CpuIsolatorTest.ROOT_UserCpuUsage/0
[       OK ] IsolationFlag/CpuIsolatorTest.ROOT_UserCpuUsage/0 (2451 ms)
[----------] 1 test from IsolationFlag/CpuIsolatorTest (2475 ms total)

[----------] 1 test from IsolationFlag/MemoryIsolatorTest
[ RUN      ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0
[       OK ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0 (2474 ms)
[----------] 1 test from IsolationFlag/MemoryIsolatorTest (2498 ms total)

[----------] Global test environment tear-down
[==========] 835 tests from 85 test cases ran. (315553 ms total)
[  PASSED  ] 825 tests.
[  FAILED  ] 10 tests, listed below:
[  FAILED  ] OfferOperationStatusUpdateManagerTest.UpdateAndAckNonTerminalUpdate
[  FAILED  ] OfferOperationStatusUpdateManagerTest.RecoverCheckpointedStream
[  FAILED  ] OfferOperationStatusUpdateManagerTest.RecoverEmptyFile
[  FAILED  ] OfferOperationStatusUpdateManagerTest.RecoverTerminatedStream
[  FAILED  ] OfferOperationStatusUpdateManagerTest.IgnoreDuplicateUpdate
[  FAILED  ] OfferOperationStatusUpdateManagerTest.IgnoreDuplicateUpdateAfterRecover
[  FAILED  ] OfferOperationStatusUpdateManagerTest.RejectDuplicateAck
[  FAILED  ] OfferOperationStatusUpdateManagerTest.RejectDuplicateAckAfterRecover
[  FAILED  ] OfferOperationStatusUpdateManagerTest.NonStrictRecoveryCorruptedFile
[  FAILED  ] OfferOperationStatusUpdateManagerTest.UpdateLatestWhenResending

10 FAILED TESTS
  YOU HAVE 205 DISABLED TESTS

```

- [mesos-tests-stderr.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/64589/logs/mesos-tests-stderr.log):

```
I1214 13:05:24.880967   788 executor.cpp:171] Received SUBSCRIBED event
I1214 13:05:24.884940   788 executor.cpp:175] Subscribed executor on build-srv-03.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net
I1214 13:05:24.885941   788 executor.cpp:171] Received LAUNCH event
I1214 13:05:24.888962   788 executor.cpp:637] Starting task 085d4ecf-ec82-4744-9347-46cd8c148780
I1214 13:05:24.968957   788 executor.cpp:477] Running 'D:\DCOS\mesos\src\mesos-containerizer.exe launch <POSSIBLY-SENSITIVE-DATA>'
I1214 13:05:25.559913   788 executor.cpp:650] Forked command at 6480
I1214 13:05:25.586910  5040 exec.cpp:435] Executor asked to shutdown
I1214 13:05:25.586910   788 executor.cpp:171] Received SHUTDOWN event
I1214 13:05:25.586910   788 executor.cpp:747] Shutting down
I1214 13:05:25.586910   788 executor.cpp:854] Sending SIGTERM to process tree at pid 6 master.cpp:3327] Deactivating framework a889b28d-e2ae-491a-8972-2329ce847d62-0000 (default) at scheduler-191bfd4d-7c22-4d9a-ac6a-4c0246f87921@10.3.1.11:55943
I1214 13:05:25.584908  6116 hierarchical.cpp:405] Deactivated framework a889b28d-e2ae-491a-8972-2329ce847d62-0000
I1214 13:05:25.584908  6396 master.cpp:10154] Updating the state of task 085d4ecf-ec82-4744-9347-46cd8c148780 of framework a889b28d-e2ae-491a-8972-2329ce847d62-0000 (latest state: TASK_KILLED, status update state: TASK_KILLED)
I1214 13:05:25.584908  6140 slave.cpp:3400] Shutting down framework a889b28d-e2ae-491a-8972-2329ce847d62-0000
I1214 13:05:25.584908  6140 slave.cpp:6106] Shutting down executor '085d4ecf-ec82-4744-9347-46cd8c148780' of framework a889b28d-e2ae-491a-8972-2329ce847d62-0000 at executor(1)@10.3.1.11:55964
I1214 13:05:25.585933  6140 slave.cpp:909] Agent terminating
W1214 13:05:25.585933  6140 slave.cpp:3396] Ignoring shutdown framework a889b28d-e2ae-491a-8972-2329ce847d62-0000 because it is terminating
I1214 13:05:25.586910  6396 master.cpp:10260] Removing task 085d4ecf-ec82-4744-9347-46cd8c148780 with resources cpus(allocated: *):4; mem(allocated: *):2048; disk(allocated: *):1024; ports(allocated: *):[31000-32000] of framework a889b28d-e2ae-491a-8972-2329ce847d62-0000 on agent a889b28d-e2ae-491a-8972-2329ce847d62-S0 at slave(327)@10.3.1.11:55943 (build-srv-03.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net)
I1214 13:05:25.589907  5784 containerizer.cpp:2337] Destroying container a3fddfbc-1d24-46ab-8fb9-a3f603a21180 in RUNNING state
I1214 13:05:25.589907  5784 containerizer.cpp:2939] Transitioning the state of container a3fddfbc-1d24-46ab-8fb9-a3f603a21180 from RUNNING to DESTROYING
I1214 13:05:25.589907  6396 master.cpp:1305] Agent a889b28d-e2ae-491a-8972-2329ce847d62-S0 at slave(327)@10.3.1.11:55943 (build-srv-03.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net) disconnected
I1214 13:05:25.589907  6396 master.cpp:3364] Disconnecting agent a889b28d-e2ae-491a-8972-2329ce847d62-S0 at slave(327)@10.3.1.11:55943 (build-srv-03.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net)
I1214 13:05:25.589907  6116 hierarchical.cpp:344] Removed framework a889b28d-e2ae-491a-8972-2329ce847d62-0000
I1214 13:05:25.590909  5784 launcher.cpp:156] Asked to destroy container a3fddfbc-1d24-46ab-8fb9-a3f603a21180
I1214 13:05:25.590909  6396 master.cpp:3383] Deactivating agent a889b28d-e2ae-491a-8972-2329ce847d62-S0 at slave(327)@10.3.1.11:55943 (build-srv-03.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net)
I1214 13:05:25.590909  6112 hierarchical.cpp:766] Agent a889b28d-e2ae-491a-8972-2329ce847d62-S0 deactivated
I1214 13:05:25.614920  6396 containerizer.cpp:2788] Container a3fddfbc-1d24-46ab-8fb9-a3f603a21180 has exited
I1214 13:05:25.644914  4140 master.cpp:1147] Master terminating
I1214 13:05:25.646930   792 hierarchical.cpp:609] Removed agent a889b28d-e2ae-491a-8972-2329ce847d62-S0
I1214 13:05:25.889905  2340 process.cpp:887] Failed to accept socket: future discarded
```

- Mesos Reviewbot Windows


On Dec. 14, 2017, 1:38 a.m., Jie Yu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64589/
> -----------------------------------------------------------
> 
> (Updated Dec. 14, 2017, 1:38 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Greg Mann, and Jan Schlicht.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The bug is introduced in this patch:
> https://reviews.apache.org/r/64477/
> 
> Given that we also keep track of each resource provider's total
> resources in addition to the total resources of the agent, we need to
> make sure we update those totals after applying an operation.
> 
> The bug may manifest as a CHECK failure in the agent that checks if
> `totalResources` of the agent is a super set of all the resource
> provider resources.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp de2b2e2e81ed860e6a33ce1b93d859d816ba1021 
>   src/slave/slave.cpp e8f7591dc0d57ca8a0eb72f6c1c008d4005a524d 
> 
> 
> Diff: https://reviews.apache.org/r/64589/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>