You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by m9a <gi...@git.apache.org> on 2018/03/30 17:10:50 UTC

[GitHub] mesos pull request #279: WIP: Remove unknown unreachable tasks when agent re...

GitHub user m9a opened a pull request:

    https://github.com/apache/mesos/pull/279

    WIP: Remove unknown unreachable tasks when agent re-registers.

    A RunTask messsage could get dropped for an agent while it's
    disconnected from the master and when such an agent goes unreachable
    then this dropped task message gets added to the unreachable tasks.
    When the agent re-registers, the master sends status updates for the
    tasks that the agent reported when re-registering and these tasks are
    also removed from the unreachableTasks on the framework but since the
    agent doesn't know about the dropped task so it doesn't get removed
    from the unreachableTasks leading to a check failure when
    this inconsistency is detected during framework removal.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/m9a/mesos bug-fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/mesos/pull/279.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #279
    
----
commit 76c0c34bb4f676b960532d6d0dd86308cd59b0ad
Author: Megha Sharma <ms...@...>
Date:   2018-03-16T00:27:06Z

    Remove unknown unreachable tasks when agent re-registers.
    
    A RunTask messsage could get dropped for an agent while it's
    disconnected from the master and when such an agent goes unreachable
    then this dropped task message gets added to the unreachable tasks.
    When the agent re-registers, the master sends status updates for the
    tasks that the agent reported when re-registering and these tasks are
    also removed from the unreachableTasks on the framework but since the
    agent doesn't know about the dropped task so it doesn't get removed
    from the unreachableTasks leading to a check failure when
    this inconsistency is detected during framework removal.

----


---

[GitHub] mesos pull request #279: WIP: Remove unknown unreachable tasks when agent re...

Posted by xujyan <gi...@git.apache.org>.
Github user xujyan commented on a diff in the pull request:

    https://github.com/apache/mesos/pull/279#discussion_r179546249
  
    --- Diff: src/master/master.cpp ---
    @@ -6805,6 +6807,7 @@ void Master::__reregisterSlave(
           Framework* framework = getFramework(frameworkId);
           if (framework != nullptr) {
             framework->unreachableTasks.erase(task.task_id());
    +        frameworkToTasks[frameworkId].erase(task.task_id());
    --- End diff --
    
    I see why you are doing this but I feel it is making things a bit more complex than necessary.
    
    Overall we are just removing tasks from `framework.unreachableTasks` from two sources, can we just separate the two? You already have the code below that does that. See comment below.


---

[GitHub] mesos pull request #279: WIP: Remove unknown unreachable tasks when agent re...

Posted by xujyan <gi...@git.apache.org>.
Github user xujyan commented on a diff in the pull request:

    https://github.com/apache/mesos/pull/279#discussion_r179542793
  
    --- Diff: src/master/master.cpp ---
    @@ -6792,6 +6793,7 @@ void Master::__reregisterSlave(
       // master (those tasks were previously marked "unreachable", so they
       // should be removed from that collection).
       vector<Task> recoveredTasks;
    +  hashmap<FrameworkID, hashset<TaskID>> frameworkToTasks = slaves.unreachableAgentsToTasks[slaveInfo.id()];
    --- End diff --
    
    If possible we should avoid copying things out? Are we able to access the tasks in-place?
    
    Also, use `at` instead of `[]` to access the map when reading it (after testing the existence of keys by `contains`) to avoid accidentally inserting empty entries.
    
    It's possible for an agent to not show up in `slaves.unreachableTasks` because of its GC algorithm right?


---

[GitHub] mesos pull request #279: WIP: Remove unknown unreachable tasks when agent re...

Posted by xujyan <gi...@git.apache.org>.
Github user xujyan commented on a diff in the pull request:

    https://github.com/apache/mesos/pull/279#discussion_r179547033
  
    --- Diff: src/master/master.cpp ---
    @@ -6843,6 +6846,17 @@ void Master::__reregisterSlave(
         recoveredTasks.push_back(std::move(task));
       }
     
    +  foreachkey(FrameworkID frameworkId, frameworkToTasks) {
    --- End diff --
    
    Can we just reference entries from `slaves.unreachableTasks` directly? Also if some items in `framework-> unreachableTasks` are already removed during recovering tasks above, they'll just be noops here.


---

[GitHub] mesos issue #279: WIP: Remove unknown unreachable tasks when agent re-regist...

Posted by m9a <gi...@git.apache.org>.
Github user m9a commented on the issue:

    https://github.com/apache/mesos/pull/279
  
    The JIRA for this PR: https://issues.apache.org/jira/browse/MESOS-8750
    Since @xujyan is shepherding it I intended to set him as the reviewer but it doesn't look like I can change those fields on the PR.


---

[GitHub] mesos pull request #279: WIP: Remove unknown unreachable tasks when agent re...

Posted by xujyan <gi...@git.apache.org>.
Github user xujyan commented on a diff in the pull request:

    https://github.com/apache/mesos/pull/279#discussion_r179548772
  
    --- Diff: src/master/master.cpp ---
    @@ -10046,6 +10060,21 @@ void Master::__removeSlave(
           unreachable = false;
           newTaskState = TASK_GONE_BY_OPERATOR;
           newTaskReason = TaskStatus::REASON_SLAVE_REMOVED_BY_OPERATOR;
    +    } else {
    --- End diff --
    
    Can this be done in `Master::removeTask()`?


---

[GitHub] mesos issue #279: WIP: Remove unknown unreachable tasks when agent re-regist...

Posted by m9a <gi...@git.apache.org>.
Github user m9a commented on the issue:

    https://github.com/apache/mesos/pull/279
  
    Closing this in favor of https://reviews.apache.org/r/66644/diff/1#index_header


---

[GitHub] mesos pull request #279: WIP: Remove unknown unreachable tasks when agent re...

Posted by m9a <gi...@git.apache.org>.
Github user m9a closed the pull request at:

    https://github.com/apache/mesos/pull/279


---

[GitHub] mesos pull request #279: WIP: Remove unknown unreachable tasks when agent re...

Posted by xujyan <gi...@git.apache.org>.
Github user xujyan commented on a diff in the pull request:

    https://github.com/apache/mesos/pull/279#discussion_r179539166
  
    --- Diff: src/master/master.hpp ---
    @@ -1913,6 +1913,10 @@ class Master : public ProtobufProcess<Master>
         // `registry_max_agent_age`, and `registry_max_agent_count` flags.
         LinkedHashMap<SlaveID, TimeInfo> unreachable;
     
    +
    +    // Holds the `FrameworkID` and set of `TaskID` for the agents marked unreachable.
    +    hashmap<SlaveID, hashmap<FrameworkID, hashset<TaskID>>> unreachableAgentsToTasks;
    --- End diff --
    
    I would suggest s/unreachableAgentsToTasks/unreachableTasks/. It's under the `slaves` struct already so the longer name would be a bit redundant. However we probably need a long comment explaining why we need to do this, in particular:
    
    - This helps us look up all unreachable tasks on an agent so we can remove them from their primary storage`framework.unreacahbleTasks` when an agent reregisters.
    - This map is bounded by the same GC behavior as `unreachable`. i.e., the agent is GC'd from `unreachable` it's also erased from `unreachableTasks`.
    
    Can we use an multihashmap (in stout)?


---

[GitHub] mesos issue #279: WIP: Remove unknown unreachable tasks when agent re-regist...

Posted by Gilbert88 <gi...@git.apache.org>.
Github user Gilbert88 commented on the issue:

    https://github.com/apache/mesos/pull/279
  
    Hi @m9a , thanks for the PR. While the issue is still in WIP stage. Would you mind to create a [JIRA-where people track on issues](https://issues.apache.org/jira/projects/MESOS) and post a patch at [reviewboard - required by Apache](http://reviews.apache.org/)? Usually, we only accept document PR or small change via Github. It was a practice from Apache. Sorry for the inconvenience.


---

[GitHub] mesos issue #279: WIP: Remove unknown unreachable tasks when agent re-regist...

Posted by xujyan <gi...@git.apache.org>.
Github user xujyan commented on the issue:

    https://github.com/apache/mesos/pull/279
  
    @Gilbert88: @m9a was having difficulty in with rbt so I told her to use a PR. Let's see how it goes first. @m9a have you created a JIRA and are you able to set the reviewer so it's clear to everyone?
    
    It's probably worth asking about the rbt issue on Mesos slack to see if others ran into it.


---

[GitHub] mesos issue #279: WIP: Remove unknown unreachable tasks when agent re-regist...

Posted by m9a <gi...@git.apache.org>.
Github user m9a commented on the issue:

    https://github.com/apache/mesos/pull/279
  
    @Gilbert88, @xujyan Sure, I will check about the rbt issue on slack.


---