You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Xuan Gong (JIRA)" <ji...@apache.org> on 2013/08/14 20:27:48 UTC

[jira] [Commented] (YARN-867) Isolation of failures in aux services

    [ https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13740009#comment-13740009 ] 

Xuan Gong commented on YARN-867:
--------------------------------

My proposal:
When there is any auxService failure, instead of simply throwing out the exceptions to the dispatcher, we will catch them and inform the AM. 

Here is how it works:

We will use containerManagementProtocol. Basically, AM will need to send the AuxiliaryServiceCheckRequest with ApplicationId as parameter frequently (We can set the period as 3s or 5s), and we use ContainerManagementProtocol to send this request to all the ContainerManager that this AM knows. Then those ContainerManagers will send the response back with the information whether there is any AuxiliaryService with this appId is failed, and related diagnositics. 

At ContainerManagerImpl side, for all the registered  AuxServices, if any of them fails, instead of simp lying throwing out of the exceptions to the dispatcher, we will catch the exceptions, and save them with appId and exception message into a AuxServiceFailureMap. In that case, when one containerManager receives  AuxiliaryServiceCheckRequest, it can check AuxServiceFailureMap with the appId, and send back the response with whether this is any  AuxServices with this appid fails.

Attached a sample code for this proposal.
                
> Isolation of failures in aux services 
> --------------------------------------
>
>                 Key: YARN-867
>                 URL: https://issues.apache.org/jira/browse/YARN-867
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Hitesh Shah
>            Assignee: Xuan Gong
>            Priority: Critical
>
> Today, a malicious application can bring down the NM by sending bad data to a service. For example, sending data to the ShuffleService such that it results any non-IOException will cause the NM's async dispatcher to exit as the service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira