You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Ilya Pronin (JIRA)" <ji...@apache.org> on 2017/04/11 15:49:42 UTC

[jira] [Comment Edited] (MESOS-7376) Long registry updates when the number of agents is high

    [ https://issues.apache.org/jira/browse/MESOS-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964547#comment-15964547 ] 

Ilya Pronin edited comment on MESOS-7376 at 4/11/17 3:49 PM:
-------------------------------------------------------------

Review request: https://reviews.apache.org/r/58355/

{{Registry}} is being copied as a part of {{state::protobuf::State::Variable<Registry>}} or as a bound parameter in {{state::protobuf::State::store()}}. The former can be mitigated by adding move support to {{Variable}} (using {{Swap()}} for protobuf message). The latter - by using {{Owned}}. But that's not enough because {{Variable}} will still be copied in return value propagation through the {{Future}}-s chain. So in my patch I bypassed {{state::protobuf::State}}.

[~bmahler] can you shepherd this please?

h4. Benchmark results
Before:
{noformat}
I0411 10:04:11.726016 11802 registrar.cpp:508] Successfully updated the registry in 89.478144ms
I0411 10:04:13.860827 11803 registrar.cpp:508] Successfully updated the registry in 216.688896ms
I0411 10:04:15.167768 11803 registrar.cpp:508] Successfully updated the registry in 1.29364096secs
I0411 10:04:18.967394 11803 registrar.cpp:508] Successfully updated the registry in 3.696552192secs
I0411 10:04:25.631009 11803 registrar.cpp:508] Successfully updated the registry in 6.267425024secs
I0411 10:04:42.625507 11803 registrar.cpp:508] Successfully updated the registry in 15.876419072secs
I0411 10:04:44.209377 11787 registrar_tests.cpp:1262] Admitted 50000 agents in 30.479743816secs
I0411 10:05:04.446650 11820 registrar.cpp:508] Successfully updated the registry in 18.338545152secs
I0411 10:05:21.171001 11820 registrar.cpp:508] Successfully updated the registry in 15.31903872secs
I0411 10:05:37.592319 11820 registrar.cpp:508] Successfully updated the registry in 14.863101952secs
I0411 10:05:39.099174 11787 registrar_tests.cpp:1276] Marked 50000 agents reachable in 53.593596102secs
../../src/tests/registrar_tests.cpp:1287: Failure
Failed to wait 15secs for registry
{noformat}
After:
{noformat}
I0411 15:19:12.228904 40643 registrar.cpp:524] Successfully updated the registry in 91.262208ms
I0411 15:19:14.543190 40660 registrar.cpp:524] Successfully updated the registry in 377.45408ms
I0411 15:19:15.707006 40660 registrar.cpp:524] Successfully updated the registry in 1.138724096secs
I0411 15:19:18.267305 40660 registrar.cpp:524] Successfully updated the registry in 2.466145792secs
I0411 15:19:19.092073 40660 registrar.cpp:524] Successfully updated the registry in 523.11296ms
I0411 15:19:20.809330 40648 registrar.cpp:524] Successfully updated the registry in 892.141824ms
I0411 15:19:21.194135 40622 registrar_tests.cpp:1262] Admitted 50000 agents in 6.938952085secs
I0411 15:19:26.973904 40637 registrar.cpp:524] Successfully updated the registry in 3.938064128secs
I0411 15:19:28.631865 40637 registrar.cpp:524] Successfully updated the registry in 1.116326144secs
I0411 15:19:30.222944 40660 registrar.cpp:524] Successfully updated the registry in 911.86688ms
I0411 15:19:30.678509 40622 registrar_tests.cpp:1276] Marked 50000 agents reachable in 8.249523305secs
I0411 15:19:35.138797 40645 registrar.cpp:524] Successfully updated the registry in 815.439104ms
I0411 15:19:41.783651 40622 registrar_tests.cpp:1288] Recovered 50000 agents (8238915B) in 10.963297677secs
I0411 15:19:47.431670 40657 registrar.cpp:524] Successfully updated the registry in 3.960920064secs
I0411 15:20:13.769872 40657 registrar.cpp:524] Successfully updated the registry in 1.169234944secs
I0411 15:21:49.685801 40657 registrar.cpp:524] Successfully updated the registry in 264.850688ms
Removed 50000 agents in 2.12256788111667mins
{noformat}

Similar picture in scale testing:
{noformat}
I0411 13:04:27.598438 41549 registrar.cpp:537] Successfully updated the registry in 2.68846208secs
I0411 13:04:30.716615 41552 registrar.cpp:537] Successfully updated the registry in 2.61457792secs
I0411 13:04:33.828133 41554 registrar.cpp:537] Successfully updated the registry in 2.644827904secs
I0411 13:04:37.634577 41553 registrar.cpp:537] Successfully updated the registry in 3.338414848secs
I0411 13:04:40.723475 41546 registrar.cpp:537] Successfully updated the registry in 2.629734144secs
{noformat}


was (Author: ipronin):
Review request: https://reviews.apache.org/r/58355/

{{Registry}} is being copied as a part of {{state::protobuf::State::Variable<Registry>}} or as a bound parameter in {{state::protobuf::State::store()}}. The former can be mitigated by adding move support to {{Variable}} (using {{Swap()}} for protobuf message). The latter - by using {{Owned}}. But that's not enough because {{Variable}} will still be copied in return value propagation through the {{Future}}-s chain. So in my patch I bypassed {{state::protobuf::State}}.

[~bmahler] can you shepherd this please?

h4. Benchmark results
Before:
{noformat}
I0411 10:04:11.726016 11802 registrar.cpp:508] Successfully updated the registry in 89.478144ms
I0411 10:04:13.860827 11803 registrar.cpp:508] Successfully updated the registry in 216.688896ms
I0411 10:04:15.167768 11803 registrar.cpp:508] Successfully updated the registry in 1.29364096secs
I0411 10:04:18.967394 11803 registrar.cpp:508] Successfully updated the registry in 3.696552192secs
I0411 10:04:25.631009 11803 registrar.cpp:508] Successfully updated the registry in 6.267425024secs
I0411 10:04:42.625507 11803 registrar.cpp:508] Successfully updated the registry in 15.876419072secs
I0411 10:04:44.209377 11787 registrar_tests.cpp:1262] Admitted 50000 agents in 30.479743816secs
I0411 10:05:04.446650 11820 registrar.cpp:508] Successfully updated the registry in 18.338545152secs
I0411 10:05:21.171001 11820 registrar.cpp:508] Successfully updated the registry in 15.31903872secs
I0411 10:05:37.592319 11820 registrar.cpp:508] Successfully updated the registry in 14.863101952secs
I0411 10:05:39.099174 11787 registrar_tests.cpp:1276] Marked 50000 agents reachable in 53.593596102secs
../../src/tests/registrar_tests.cpp:1287: Failure
Failed to wait 15secs for registry
{noformat}
After:
{noformat}
I0411 15:19:12.228904 40643 registrar.cpp:524] Successfully updated the registry in 91.262208ms
I0411 15:19:14.543190 40660 registrar.cpp:524] Successfully updated the registry in 377.45408ms
I0411 15:19:15.707006 40660 registrar.cpp:524] Successfully updated the registry in 1.138724096secs
I0411 15:19:18.267305 40660 registrar.cpp:524] Successfully updated the registry in 2.466145792secs
I0411 15:19:19.092073 40660 registrar.cpp:524] Successfully updated the registry in 523.11296ms
I0411 15:19:20.809330 40648 registrar.cpp:524] Successfully updated the registry in 892.141824ms
I0411 15:19:21.194135 40622 registrar_tests.cpp:1262] Admitted 50000 agents in 6.938952085secs
I0411 15:19:26.973904 40637 registrar.cpp:524] Successfully updated the registry in 3.938064128secs
I0411 15:19:28.631865 40637 registrar.cpp:524] Successfully updated the registry in 1.116326144secs
I0411 15:19:30.222944 40660 registrar.cpp:524] Successfully updated the registry in 911.86688ms
I0411 15:19:30.678509 40622 registrar_tests.cpp:1276] Marked 50000 agents reachable in 8.249523305secs
I0411 15:19:35.138797 40645 registrar.cpp:524] Successfully updated the registry in 815.439104ms
I0411 15:19:41.783651 40622 registrar_tests.cpp:1288] Recovered 50000 agents (8238915B) in 10.963297677secs
I0411 15:19:47.431670 40657 registrar.cpp:524] Successfully updated the registry in 3.960920064secs
I0411 15:20:13.769872 40657 registrar.cpp:524] Successfully updated the registry in 1.169234944secs
I0411 15:21:49.685801 40657 registrar.cpp:524] Successfully updated the registry in 264.850688ms
Removed 50000 agents in 2.12256788111667mins
{noformat}

> Long registry updates when the number of agents is high
> -------------------------------------------------------
>
>                 Key: MESOS-7376
>                 URL: https://issues.apache.org/jira/browse/MESOS-7376
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 1.3.0
>            Reporter: Ilya Pronin
>            Assignee: Ilya Pronin
>
> During scale testing we discovered that as the number of registered agents grows the time it takes to update the registry grows to unacceptable values very fast. At some point it starts exceeding {{registry_store_timeout}} which doesn't fire.
> With 55k agents we saw this ({{registry_store_timeout=20secs}}):
> {noformat}
> I0331 17:11:21.227442 36472 registrar.cpp:473] Applied 69 operations in 3.138843387secs; attempting to update the registry
> I0331 17:11:24.441409 36464 log.cpp:529] LogStorage.set: acquired the lock in 74461ns
> I0331 17:11:24.441541 36464 log.cpp:543] LogStorage.set: started in 51770ns
> I0331 17:11:26.869323 36462 log.cpp:628] LogStorage.set: wrote append at position=6420881 in 2.41043644secs
> I0331 17:11:26.869454 36462 state.hpp:179] State.store: storage.set has finished in 2.428189561secs (b=1)
> I0331 17:11:56.199453 36469 registrar.cpp:518] Successfully updated the registry in 34.971944192secs
> {noformat}
> This is caused by repeated {{Registry}} copying which involves copying a big object graph that takes roughly 0.4 sec (with 55k agents).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)