You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by "Ignacio Cano (JIRA)" <ji...@apache.org> on 2015/08/13 01:59:45 UTC

[jira] [Updated] (REEF-568) Work around the federated YARN container allocation problem

     [ https://issues.apache.org/jira/browse/REEF-568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ignacio Cano updated REEF-568:
------------------------------
    Description: 
When trying to use REEF with Federation, there's a problem on the node reports YARN sends us.
Just after initializing our yarn client library (hadoop-yarn-client-2.4.0), we asked for the RUNNING nodes in the cluster to populate our own Resource Catalog.
YARN replies with the nodes that belong to a 'random' sub-cluster; sometimes with the nodes in the correct sub-cluster (where the AM was placed), and sometimes with other ones.
That causes the application to randomly fail.
For example, we populate the nodes in sub-cluster 1, but the allocations are actually made on sub-cluster 2, so we fail.

We need to do a work around for this issue, and YARN folks are not sure when they will have the right.

  was:
We need to do a work around to the federated YARN container allocation issue. 
YARN client library calls the onError callback on YarnContainerManager with a Throwable. Either we can inspect the Throwable and identify what is the error (multiple allocations for one request) and ignore it silently, or we will have to "override" the class in the YARN client library that makes that call.

As Markus suggested, we can guard our onError with a configuration parameter.



> Work around the federated YARN container allocation problem
> -----------------------------------------------------------
>
>                 Key: REEF-568
>                 URL: https://issues.apache.org/jira/browse/REEF-568
>             Project: REEF
>          Issue Type: Task
>            Reporter: Ignacio Cano
>            Assignee: Ignacio Cano
>            Priority: Minor
>
> When trying to use REEF with Federation, there's a problem on the node reports YARN sends us.
> Just after initializing our yarn client library (hadoop-yarn-client-2.4.0), we asked for the RUNNING nodes in the cluster to populate our own Resource Catalog.
> YARN replies with the nodes that belong to a 'random' sub-cluster; sometimes with the nodes in the correct sub-cluster (where the AM was placed), and sometimes with other ones.
> That causes the application to randomly fail.
> For example, we populate the nodes in sub-cluster 1, but the allocations are actually made on sub-cluster 2, so we fail.
> We need to do a work around for this issue, and YARN folks are not sure when they will have the right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)