You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Benjamin Mahler (Jira)" <ji...@apache.org> on 2019/10/18 19:08:00 UTC

[jira] [Commented] (MESOS-10005) Only broadcast framework update to agents associated with framework

    [ https://issues.apache.org/jira/browse/MESOS-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954913#comment-16954913 ] 

Benjamin Mahler commented on MESOS-10005:
-----------------------------------------

Attached a patch that can be applied on 1.9.x to see if parallel broadcasting helps alleviate this.

> Only broadcast framework update to agents associated with framework
> -------------------------------------------------------------------
>
>                 Key: MESOS-10005
>                 URL: https://issues.apache.org/jira/browse/MESOS-10005
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 1.9.0
>         Environment: Ubuntu Bionic 18.04, Mesos 1.9.0 on the master, Mesos 1.4.1 on the agents. Spark 2.1.1 is the primary framework running.
>            Reporter: Terra Field
>            Priority: Major
>         Attachments: 0001-Send-framework-updates-in-parallel.patch, mesos-master.log.gz, mesos-master.stacks - 2 - 1.9.0.gz, mesos-master.stacks - 3 - 1.9.0.gz, mesos-master.stacks - 4 - framework update - 1.9.0.gz, mesos-master.stacks - 5 - new healthy master.gz
>
>
> We have at any given time ~100 frameworks connected to our Mesos Master with agents spread across anywhere from 6,000 to 11,000 EC2 instances. We've been encounter a crash (which I'll document separately) and when that happens, the new Mesos Master will sometimes (but not always) struggle to catch up, and eventually crash again. Usually the third or fourth crash will end with a stable master (not ideal, but at least we can get to stable).
> Looking over the logs, I'm seeing hundreds of attempts to contact dead agents each second (and presumably many contacts with healthy agents that don't throw an error):
> {noformat}
> W1003 21:39:39.299998  8618 process.cpp:1917] Failed to send 'mesos.internal.UpdateFrameworkMessage' to '100.82.103.99:5051', connect: Failed to connect to 100.82.103.99:5051: Connection refused W1003 21:39:39.300143  8618 process.cpp:1917] Failed to send 'mesos.internal.UpdateFrameworkMessage' to '100.85.122.190:5051', connect: Failed to connect to 100.85.122.190:5051: Connection refused W1003 21:39:39.300285  8618 process.cpp:1917] Failed to send 'mesos.internal.UpdateFrameworkMessage' to '100.85.84.187:5051', connect: Failed to connect to 100.85.84.187:5051: Connection refused W1003 21:39:39.302122  8618 process.cpp:1917] Failed to send 'mesos.internal.UpdateFrameworkMessage' to '100.82.163.228:5051', connect: Failed to connect to 100.82.163.228:5051: Connection refused{noformat}
> I gave [~bmahler] a perf trace of the master on Slack at this point, and it looks like the master at is spending a significant amount of time doing framework update broadcasting. I'll attach the perf dump to the ticket, as well as the log of what the master did while it was alive.
> It sounds like currently, every framework update (100 total frameworks in our case) results in a broadcast to all 6000-11000 agents (depending on how busy the cluster is). Also, since our health checks rely on the UI currently, we usually end up killing the master because it fails a health check for long periods of time while overwhelmed by doing these broadcasts.
> Could optimizations to be made to either throttle these broadcasts or to only target nodes which need those framework updates?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)