You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@myriad.apache.org by "Sarjeet Singh (JIRA)" <ji...@apache.org> on 2015/09/09 02:09:46 UTC
[jira] [Created] (MYRIAD-133) Multiple flexed up NMs try to run on same node, altogether.

Sarjeet Singh created MYRIAD-133:
------------------------------------

             Summary: Multiple flexed up NMs try to run on same node, altogether.
                 Key: MYRIAD-133
                 URL: https://issues.apache.org/jira/browse/MYRIAD-133
             Project: Myriad
          Issue Type: Bug
          Components: Scheduler
    Affects Versions: Myriad 0.1.0
            Reporter: Sarjeet Singh


On a 3 node cluster with latest build running with NM +Executor merge, I am seeing issue with flexing up
Multiple instances of NMs that multiple NMs try to start on same node at same
time altogether.

Here is the existing/Already running tasks from Myriad: (Before multiple NM
flex up)

[root@qa101-137 ~]# curl -s http://testrm.marathon.mesos:8192/api/state
{"pendingTasks":[],
"stagingTasks":[],
"activeTasks":[
"jobhistory.jobhistory.a25e35c8-c551-498a-81ff-5b29389064c7",
"nm.medium.a8a36268-e365-4fd2-a87c-4c02ac2aeb89",
"nm.small.30e0ce9c-f9da-49de-b927-ab8a58be6d52"],
"killableTasks":[]}

Then, I tried flexing up 4 instances of Zero-profile NM, Keep note that only 1
Node is without any NM, other 2 nodes already running NMs (See above).

here is the task status from myriad just after flex up and when all NMs were in
active state.

[root@qa101-137 ~]# curl -H "Content-Type: application/json" -X PUT -d
'{"instances":4, "profile":"zero"}'
http://testrm.marathon.mesos:8192/api/cluster/flexup

[root@qa101-137 ~]# curl -s http://testrm.marathon.mesos:8192/api/state |
python -mjson.tool
{
    "activeTasks": [
        "jobhistory.jobhistory.a25e35c8-c551-498a-81ff-5b29389064c7", 
        "nm.medium.a8a36268-e365-4fd2-a87c-4c02ac2aeb89", 
        "nm.small.30e0ce9c-f9da-49de-b927-ab8a58be6d52"
    ], 
    "killableTasks": [], 
    "pendingTasks": [
        "nm.zero.cd35db39-30f0-4da5-aa07-67c22cfe40ee", 
        "nm.zero.ad7d597c-27f8-4e2c-8108-ae675990fdd9", 
        "nm.zero.5110931a-279e-4f95-b4e6-5d1167d45993"
    ], 
    "stagingTasks": [
        "nm.zero.a5e73358-351f-4938-ba3d-9dc759b514e0"
    ]
}

[root@qa101-137 ~]# curl -s http://testrm.marathon.mesos:8192/api/state |
python -mjson.tool
{
    "activeTasks": [
        "jobhistory.jobhistory.a25e35c8-c551-498a-81ff-5b29389064c7", 
        "nm.zero.a5e73358-351f-4938-ba3d-9dc759b514e0", 
        "nm.medium.a8a36268-e365-4fd2-a87c-4c02ac2aeb89", 
        "nm.small.30e0ce9c-f9da-49de-b927-ab8a58be6d52", 
        "nm.zero.cd35db39-30f0-4da5-aa07-67c22cfe40ee", 
        "nm.zero.ad7d597c-27f8-4e2c-8108-ae675990fdd9", 
        "nm.zero.5110931a-279e-4f95-b4e6-5d1167d45993"
    ], 
    "killableTasks": [], 
    "pendingTasks": [], 
    "stagingTasks": []
}

On Mesos,  all 4 NMs tries to start on a single node, and they all in RUNNING
state at some point, and then moved to LOST state after all NMs settled down.
Also, Myriad moved the rest of NON-Successful tasks from active to pending
state later on.

[root@qa101-137 ~]# curl -s http://testrm.marathon.mesos:8192/api/state |
python -mjson.tool
{
    "activeTasks": [
        "jobhistory.jobhistory.a25e35c8-c551-498a-81ff-5b29389064c7", 
        "nm.zero.a5e73358-351f-4938-ba3d-9dc759b514e0", 
        "nm.medium.a8a36268-e365-4fd2-a87c-4c02ac2aeb89", 
        "nm.small.30e0ce9c-f9da-49de-b927-ab8a58be6d52"
    ], 
    "killableTasks": [], 
    "pendingTasks": [
        "nm.zero.cd35db39-30f0-4da5-aa07-67c22cfe40ee", 
        "nm.zero.ad7d597c-27f8-4e2c-8108-ae675990fdd9", 
        "nm.zero.5110931a-279e-4f95-b4e6-5d1167d45993"
    ], 
    "stagingTasks": []
}
Let me know if need any additional details regarding the issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)