You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2015/10/16 00:33:05 UTC

[jira] [Updated] (TEZ-2872) Tez AM can be overwhelmed by TezTaskUmbilicalProtocol.getTask responses

     [ https://issues.apache.org/jira/browse/TEZ-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated TEZ-2872:
----------------------------
    Attachment: TEZ-2872.gettask-governor.patch

Here's a crude patch that allows the client to configure a maximum number of tasks that will be launched per second.  If the maximum would be exceeded it simply returns null for the getTask() call and expects the container to poll again later.


> Tez AM can be overwhelmed by TezTaskUmbilicalProtocol.getTask responses
> -----------------------------------------------------------------------
>
>                 Key: TEZ-2872
>                 URL: https://issues.apache.org/jira/browse/TEZ-2872
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jason Lowe
>         Attachments: TEZ-2872.gettask-governor.patch
>
>
> When a large job runs on a large cluster with a large user payload then the AM can end up hitting OOM conditions.  For example, Pig-on-Tez can require a significant user payload (approaching 1MB) for vertices, inputs, and outputs in the DAG.  This can cause the ContainerTask response to be rather large per task, which can lead to a situation where the AM is generating output faster than the network interface can process it.  If there are enough containers asking for tasks then this leads to an OOM condition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)