You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Jay Guo (JIRA)" <ji...@apache.org> on 2016/05/25 14:58:12 UTC

[jira] [Commented] (MESOS-3302) Scheduler API v1 improvements

    [ https://issues.apache.org/jira/browse/MESOS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300176#comment-15300176 ] 

Jay Guo commented on MESOS-3302:
--------------------------------

[~vinodkone]

We are manually testing HTTP APIs now and here are some observations:

*Cluster setup:*
* Bring up 3 masters, 3 agents, 3 zookeepers
* Agents should be started with --use_http_command_executor flag (which uses http command executor)
* Start long lived framework (which uses http scheduler api)

*Test cases:*
* Restart leading master
_The framework is started with {{--master=<master-ip>}}. Therefore, it always talks to fixed master no matter being leader or follower._ 
*Expected:* {{307 Temporary Redirect}} and scheduler actually handles redirect and talks to real leader master, and these should be transparent to framework
*Actual:* It reports this back to framework.
Is this intended behaviour? On the other hand, when framework is started with --master=zk://... it correctly handles master detection and resumes when new leader master is elected. Although master detection happens continuously without a break. Do we consider to introduce an interval?

* Restart agent
*Expected:* Workload is migrated to other agents if current agent is down for a period longer than timeout, therefore removed. If agent is resurrected within the timeout, it resumes the tasks.
*Actual:* Framework keeps waiting for the agent to recover. It does resume working if agent is back in time. Otherwise, it keeps waiting indefinitely.
I guess this is reasonable since that long-lived-framework declines other offers, which will not be offered again to this framework. I don't see there's an option to expire the decline-offer-filter though, or am I missing something?
There are also chances that the agent resumes running tasks for a little while and then _asked to terminate_ by master. This is somewhat flaky, need to investigate further.

* Restart long lived framework
*Expected:* Recover
*Actual:* Recover

* Restart all masters at once
Same behaviour as _restarting leading master_

* Emulate network partitions (1 way - 2 way) between long lived framework and master
_network partition is emulated at tcp layer using iptables rule {{iptables -A INPUT -p tcp -s <framework-ip> -dport 5050 -j DROP}}
** One-way: Master <--X-- Framework
For most cases it works as expected: framework simply hangs. Agent keeps resending messages since acknowledgements are blocked. When block is lifted, everything resumes to work. However there was once that agent keeps launching new tasks without framework being aware of it during partition. Need to find a way to reproduce it. I guess it has something to do with the status when network is cut.
** Two-way: WIP

* Restart leading Zookeeper
WIP

* Restart all Zookeepers at once
WIP

> Scheduler API v1 improvements
> -----------------------------
>
>                 Key: MESOS-3302
>                 URL: https://issues.apache.org/jira/browse/MESOS-3302
>             Project: Mesos
>          Issue Type: Epic
>            Reporter: Marco Massenzio
>              Labels: mesosphere, twitter
>
> This Epic covers all the refinements that we may want to build on top of the {{HTTP API}} MVP epic (MESOS-2288) which was released initially with Mesos {{0.24.0}}.
> The tasks/stories here cover the necessary work to bring the API v1 to what we would regard as "Production-ready" state in preparation for the {{1.0.0}} release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)