You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by "John Yang (JIRA)" <ji...@apache.org> on 2014/11/14 11:32:35 UTC
[jira] [Updated] (REEF-42) Extra YARN container causing unexpected memory reservations

     [ https://issues.apache.org/jira/browse/REEF-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Yang updated REEF-42:
--------------------------
    Description: 
Our cluster has 4 nodes, each with 13.67GB of memory. When we launch Surf(a long-running job) with 4 evaluators(7GB each), The available memory becomes 6.67GBX3nodes and 5.67GBX1node(AM is 1GB). But an extra container request(7GB), hangs at RM, because of the following reasons.

# Because Surf is a long-running job, the evaluators that have been allocated do not exit and make room for the extra container. If there was a room, REEF would have been notified of the allocation of the extra container and released it right away.
# To avoid YARN-314, currently we never send a 0-container request, which in effect removes the hanging extra container

As a result, RM infinitely tries to allocate the hanging request, reserving 7GB for each node. So, *Memory Reserved* metric increases and *Memory Available* metric decreases.

The same thing happens when we explicitly request for more than the capacity, say 8GBX5evaluators. But the difference is that the one caused by the extra container is unpredictable.

[~chobrian] and I discussed the tradeoff between the followings.
# Send 0-container requests and address YARN-314 differently by adding another indirection atop AMRMClient or replacing it altogether
# Wait until YARN-314 is resolved since our case is not common and can be discovered and fixed by the system administrator

We think the second approach is better. Once YARN-314 is resolved, I'll create a patch that allows sending 0-container requests. 

Any suggestions are welcome.

  was:
Our cluster has 4 nodes, each with 13.67GB of memory. When we launch Surf(a long-running job) with 4 evaluators(7GB each), The available memory becomes 6.67GBX3nodes and 5.67GBX1node(AM is 1GB). But an extra container request(7GB), hangs at RM, because of the following reasons.

# Because Surf is a long-running job, the evaluators that have been allocated do not exit and make room for the extra container. If there was a room, REEF would have been notified of the allocated container and released it right away.
# To avoid YARN-314, currently we never send a 0-container request, which in effect removes the hanging extra container

As a result, RM infinitely tries to allocate the hanging request, reserving 7GB for each node. So, *Memory Reserved* metric increases and *Memory Available* metric decreases.

The same thing happens when we explicitly request for more than the capacity, say 8GBX5evaluators. But the difference is that the one caused by the extra container is unpredictable.

[~chobrian] and I discussed the tradeoff between the followings.
# Send 0-container requests and address YARN-314 differently by adding another indirection atop AMRMClient or replacing it altogether
# Wait until YARN-314 is resolved since our case is not common and can be discovered and fixed by the system administrator

We think the second approach is better. Once YARN-314 is resolved, I'll create a patch that allows sending 0-container requests. 

Any suggestions are welcome.


> Extra YARN container causing unexpected memory reservations
> -----------------------------------------------------------
>
>                 Key: REEF-42
>                 URL: https://issues.apache.org/jira/browse/REEF-42
>             Project: REEF
>          Issue Type: Bug
>          Components: REEF-Runtime-YARN
>            Reporter: John Yang
>            Assignee: John Yang
>
> Our cluster has 4 nodes, each with 13.67GB of memory. When we launch Surf(a long-running job) with 4 evaluators(7GB each), The available memory becomes 6.67GBX3nodes and 5.67GBX1node(AM is 1GB). But an extra container request(7GB), hangs at RM, because of the following reasons.
> # Because Surf is a long-running job, the evaluators that have been allocated do not exit and make room for the extra container. If there was a room, REEF would have been notified of the allocation of the extra container and released it right away.
> # To avoid YARN-314, currently we never send a 0-container request, which in effect removes the hanging extra container
> As a result, RM infinitely tries to allocate the hanging request, reserving 7GB for each node. So, *Memory Reserved* metric increases and *Memory Available* metric decreases.
> The same thing happens when we explicitly request for more than the capacity, say 8GBX5evaluators. But the difference is that the one caused by the extra container is unpredictable.
> [~chobrian] and I discussed the tradeoff between the followings.
> # Send 0-container requests and address YARN-314 differently by adding another indirection atop AMRMClient or replacing it altogether
> # Wait until YARN-314 is resolved since our case is not common and can be discovered and fixed by the system administrator
> We think the second approach is better. Once YARN-314 is resolved, I'll create a patch that allows sending 0-container requests. 
> Any suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)