You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Bibin A Chundatt (JIRA)" <ji...@apache.org> on 2018/11/05 10:24:00 UTC
[jira] [Comment Edited] (YARN-8933) [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in allocation response

    [ https://issues.apache.org/jira/browse/YARN-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674898#comment-16674898 ] 

Bibin A Chundatt edited comment on YARN-8933 at 11/5/18 10:23 AM:
------------------------------------------------------------------

Thank you [~botong] for patch.

Overall approach looks good to me.

Moving 1 min timeout from LocalityMulticastAMRMProxyPolicy to FederationInteceptor and caching the last response.
 All the policies should be able to take advantage of the same.
  
 One concern is, what happens if due to some crazy GC at AM side , AM doesn't set heartbeat for one min. As per the current implementation will never send allocate request to secondary subclusters rt ?

To evaluate timeout of subcluster we should consider the last allocate/heartbeat from AM.

Also could you add a test to verify recover case, with LocalityMulticastAMRMProxyPolicy for verification.


was (Author: bibinchundatt):
Thank you [~botong] for patch.

Overall approach looks good to me.

Moving 1 min timeout from LocalityMulticastAMRMProxyPolicy to FederationInteceptor and caching the last response.
All the policies should be able to take advantage of the same.
 
One concern is, what happens if due to some crazy GC at AM side , AM doesn't set heartbeat for one min. As per the current implementation will never send allocate request to secondary subclusters rt ?

To evaluate timeout of subcluster we should  consider the last allocate/heartbeat from AM.

Also could you add a test to verify recover case, with LocalityMulticastAMRMProxyPolicy i think {{AllocationBookkeeper#activeAndEnabledSC}} will be empty always.

> [AMRMProxy] Fix potential empty AvailableResource and NumClusterNode in allocation response
> -------------------------------------------------------------------------------------------
>
>                 Key: YARN-8933
>                 URL: https://issues.apache.org/jira/browse/YARN-8933
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: amrmproxy, federation
>            Reporter: Botong Huang
>            Assignee: Botong Huang
>            Priority: Major
>         Attachments: YARN-8933.v1.patch, YARN-8933.v2.patch
>
>
> After YARN-8696, the allocate response by FederationInterceptor is merged from the responses from a random subset of all sub-clusters, depending on the async heartbeat timing. As a result, cluster-wide information fields in the response, e.g. AvailableResources and NumClusterNodes, are not consistent at all. It can even be null/zero because the specific response is merged from an empty set of sub-cluster responses. 
> In this patch, we let FederationInterceptor remember the last allocate response from all known sub-clusters, and always construct the cluster-wide info fields from all of them. We also moved sub-cluster timeout from LocalityMulticastAMRMProxyPolicy to FederationInterceptor, so that sub-clusters that expired (haven't had a successful allocate response for a while) won't be included in the computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org