You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@myriad.apache.org by "Santosh Marella (JIRA)" <ji...@apache.org> on 2015/10/14 10:17:05 UTC

[jira] [Commented] (MYRIAD-133) Multiple flexed up NMs try to run on same node, altogether.

    [ https://issues.apache.org/jira/browse/MYRIAD-133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956477#comment-14956477 ] 

Santosh Marella commented on MYRIAD-133:
----------------------------------------

Only the active NM tasks are checked for the hostname uniqueness. The problem is that the NM tasks in staging state are not checked for hostname uniqueness. 

> Multiple flexed up NMs try to run on same node, altogether.
> -----------------------------------------------------------
>
>                 Key: MYRIAD-133
>                 URL: https://issues.apache.org/jira/browse/MYRIAD-133
>             Project: Myriad
>          Issue Type: Bug
>          Components: Scheduler
>            Reporter: Sarjeet Singh
>            Assignee: Swapnil Daingade
>             Fix For: Myriad 0.1.0
>
>
> On a 3 node cluster with latest build running with NM +Executor merge, I am seeing issue with flexing up
> Multiple instances of NMs that multiple NMs try to start on same node at same
> time altogether.
> Here is the existing/Already running tasks from Myriad: (Before multiple NM
> flex up)
> [root@qa101-137 ~]# curl -s http://testrm.marathon.mesos:8192/api/state
> {"pendingTasks":[],
> "stagingTasks":[],
> "activeTasks":[
> "nm.medium.a8a36268-e365-4fd2-a87c-4c02ac2aeb89",
> "nm.small.30e0ce9c-f9da-49de-b927-ab8a58be6d52"],
> "killableTasks":[]}
> Then, I tried flexing up 4 instances of Zero-profile NM, Keep note that only 1
> Node is without any NM, other 2 nodes already running NMs (See above).
> here is the task status from myriad just after flex up and when all NMs were in
> active state.
> [root@qa101-137 ~]# curl -H "Content-Type: application/json" -X PUT -d
> '{"instances":4, "profile":"zero"}'
> http://testrm.marathon.mesos:8192/api/cluster/flexup
> [root@qa101-137 ~]# curl -s http://testrm.marathon.mesos:8192/api/state |
> python -mjson.tool
> {
>     "activeTasks": [
>          "nm.medium.a8a36268-e365-4fd2-a87c-4c02ac2aeb89", 
>         "nm.small.30e0ce9c-f9da-49de-b927-ab8a58be6d52"
>     ], 
>     "killableTasks": [], 
>     "pendingTasks": [
>         "nm.zero.cd35db39-30f0-4da5-aa07-67c22cfe40ee", 
>         "nm.zero.ad7d597c-27f8-4e2c-8108-ae675990fdd9", 
>         "nm.zero.5110931a-279e-4f95-b4e6-5d1167d45993"
>     ], 
>     "stagingTasks": [
>         "nm.zero.a5e73358-351f-4938-ba3d-9dc759b514e0"
>     ]
> }
> [root@qa101-137 ~]# curl -s http://testrm.marathon.mesos:8192/api/state |
> python -mjson.tool
> {
>     "activeTasks": [
>         "nm.zero.a5e73358-351f-4938-ba3d-9dc759b514e0", 
>         "nm.medium.a8a36268-e365-4fd2-a87c-4c02ac2aeb89", 
>         "nm.small.30e0ce9c-f9da-49de-b927-ab8a58be6d52", 
>         "nm.zero.cd35db39-30f0-4da5-aa07-67c22cfe40ee", 
>         "nm.zero.ad7d597c-27f8-4e2c-8108-ae675990fdd9", 
>         "nm.zero.5110931a-279e-4f95-b4e6-5d1167d45993"
>     ], 
>     "killableTasks": [], 
>     "pendingTasks": [], 
>     "stagingTasks": []
> }
> On Mesos,  all 4 NMs tries to start on a single node, and they all in RUNNING
> state at some point, and then moved to LOST state after all NMs settled down.
> Also, Myriad moved the rest of NON-Successful tasks from active to pending
> state later on.
> [root@qa101-137 ~]# curl -s http://testrm.marathon.mesos:8192/api/state |
> python -mjson.tool
> {
>     "activeTasks": [
>         "nm.zero.a5e73358-351f-4938-ba3d-9dc759b514e0", 
>         "nm.medium.a8a36268-e365-4fd2-a87c-4c02ac2aeb89", 
>         "nm.small.30e0ce9c-f9da-49de-b927-ab8a58be6d52"
>     ], 
>     "killableTasks": [], 
>     "pendingTasks": [
>         "nm.zero.cd35db39-30f0-4da5-aa07-67c22cfe40ee", 
>         "nm.zero.ad7d597c-27f8-4e2c-8108-ae675990fdd9", 
>         "nm.zero.5110931a-279e-4f95-b4e6-5d1167d45993"
>     ], 
>     "stagingTasks": []
> }
> Let me know if need any additional details regarding the issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)