You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Vinod Kone (JIRA)" <ji...@apache.org> on 2019/04/08 17:28:00 UTC

[jira] [Comment Edited] (MESOS-6285) Agents may OOM during recovery if there are too many tasks or executors

    [ https://issues.apache.org/jira/browse/MESOS-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812550#comment-16812550 ] 

Vinod Kone edited comment on MESOS-6285 at 4/8/19 5:27 PM:
-----------------------------------------------------------

We already limit the number of completed tasks per executor (200, not configurable), completed executors per framework (150, configurable) and max frameworks (50, not configurable) in memory. I don't think there's much value in storing metadata information about more than these tasks/executors/frameworks on the disk? If yes, we need to figure out how to GC a task/executor/framework once it goes out of the in-memory circular buffers / bounded hashmaps holding these.


was (Author: vinodkone):
We already limit the number of completed tasks per executor (200, not configurable) and completed executors per framework (150, configurable) in memory. I don't think there's much value in storing metadata information about more than these tasks/executors on the disk? If yes, we need to figure out how to GC a task/executor once it goes out of the in-memory circular buffers / bounded hashmaps holding these.

> Agents may OOM during recovery if there are too many tasks or executors
> -----------------------------------------------------------------------
>
>                 Key: MESOS-6285
>                 URL: https://issues.apache.org/jira/browse/MESOS-6285
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.0.1
>            Reporter: Joseph Wu
>            Priority: Critical
>              Labels: mesosphere
>
> On an test cluster, we encountered a degenerate case where running the example {{long-lived-framework}} for over a week would render the agent un-recoverable.  
> The {{long-lived-framework}} creates one custom {{long-lived-executor}} and launches a single task on that executor every time it receives an offer from that agent.  Over a week's worth of time, the framework manages to launch some 400k tasks (short sleeps) on one executor.  During runtime, this is not problematic, as each completed task is quickly rotated out of the agent's memory (and checkpointed to disk).
> During recovery, however, the agent reads every single task into memory, which leads to slow recovery; and often results in the agent being OOM-killed before it finishes recovering.
> To repro this condition quickly:
> 1) Apply this patch to the {{long-lived-framework}}:
> {code}
> diff --git a/src/examples/long_lived_framework.cpp b/src/examples/long_lived_framework.cpp
> index 7c57eb5..1263d82 100644
> --- a/src/examples/long_lived_framework.cpp
> +++ b/src/examples/long_lived_framework.cpp
> @@ -358,16 +358,6 @@ private:
>    // Helper to launch a task using an offer.
>    void launch(const Offer& offer)
>    {
> -    int taskId = tasksLaunched++;
> -    ++metrics.tasks_launched;
> -
> -    TaskInfo task;
> -    task.set_name("Task " + stringify(taskId));
> -    task.mutable_task_id()->set_value(stringify(taskId));
> -    task.mutable_agent_id()->MergeFrom(offer.agent_id());
> -    task.mutable_resources()->CopyFrom(taskResources);
> -    task.mutable_executor()->CopyFrom(executor);
> -
>      Call call;
>      call.set_type(Call::ACCEPT);
>  
> @@ -380,7 +370,23 @@ private:
>      Offer::Operation* operation = accept->add_operations();
>      operation->set_type(Offer::Operation::LAUNCH);
>  
> -    operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +    // Launch as many tasks as possible in the given offer.
> +    Resources remaining = Resources(offer.resources()).flatten();
> +    while (remaining.contains(taskResources)) {
> +      int taskId = tasksLaunched++;
> +      ++metrics.tasks_launched;
> +
> +      TaskInfo task;
> +      task.set_name("Task " + stringify(taskId));
> +      task.mutable_task_id()->set_value(stringify(taskId));
> +      task.mutable_agent_id()->MergeFrom(offer.agent_id());
> +      task.mutable_resources()->CopyFrom(taskResources);
> +      task.mutable_executor()->CopyFrom(executor);
> +
> +      operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +
> +      remaining -= taskResources;
> +    }
>  
>      mesos->send(call);
>    }
> {code}
> 2) Run a master, agent, and {{long-lived-framework}}.  On a 1 CPU, 1 GB agent + this patch, it should take about 10 minutes to build up sufficient task launches.
> 3) Restart the agent and watch it flail during recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)