You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by "Ignacio Cano (JIRA)" <ji...@apache.org> on 2015/08/13 18:36:45 UTC

[jira] [Updated] (REEF-589) REEF crashes when new nodes are added to the clusters dynamically

     [ https://issues.apache.org/jira/browse/REEF-589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ignacio Cano updated REEF-589:
------------------------------
    Description: 
When trying to use REEF with Federation, we found a problem in REEF resource catalog.
If an admin happens to add a new node to the cluster dynamically, and future allocations are done in that node, REEF crashes as it is not able to find that node in the catalog.
Though we found this problem using YARN, it will happen the same for other RMs.

  was:
When trying to use REEF with Federation, there's a problem on the node reports YARN sends us.
Just after initializing our yarn client library (hadoop-yarn-client-2.4.0), we ask for the RUNNING nodes in the cluster to populate our own Resource Catalog.
YARN replies with the nodes that belong to a 'random' sub-cluster; sometimes with the nodes in the correct sub-cluster (where the containers will be placed), and sometimes with other ones. That causes the application to randomly fail.
For example, we populate our resource catalog with nodes in sub-cluster 1, but the allocations are actually made on sub-cluster 2, so we fail.

We need to do a work around for this issue, as YARN folks are not sure when they will have the right.


> REEF crashes when new nodes are added to the clusters dynamically
> -----------------------------------------------------------------
>
>                 Key: REEF-589
>                 URL: https://issues.apache.org/jira/browse/REEF-589
>             Project: REEF
>          Issue Type: Task
>            Reporter: Ignacio Cano
>            Assignee: Ignacio Cano
>            Priority: Minor
>
> When trying to use REEF with Federation, we found a problem in REEF resource catalog.
> If an admin happens to add a new node to the cluster dynamically, and future allocations are done in that node, REEF crashes as it is not able to find that node in the catalog.
> Though we found this problem using YARN, it will happen the same for other RMs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)