You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by "Tao Yang (JIRA)" <ji...@apache.org> on 2017/03/28 11:26:41 UTC

[jira] [Created] (YARN-6403) Invalid local resource request can raise NPE and make NM exit

Tao Yang created YARN-6403:
------------------------------

             Summary: Invalid local resource request can raise NPE and make NM exit
                 Key: YARN-6403
                 URL: https://issues.apache.org/jira/browse/YARN-6403
             Project: Hadoop YARN
          Issue Type: Bug
          Components: nodemanager
    Affects Versions: 2.8.0
            Reporter: Tao Yang


Recently we found this problem on our testing environment. The app that caused this problem added a invalid local resource request(have no location) into ContainerLaunchContext like this:
{code}
    localResources.put("test", LocalResource.newInstance(location,
        LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
        System.currentTimeMillis()));
    ContainerLaunchContext amContainer =
        ContainerLaunchContext.newInstance(localResources, environment,
          vargsFinal, null, securityTokens, acls);
{code}

The actual value of location was null although app doesn't expect that. This mistake cause several NMs exited with the NPE below and can't restart until the nm recovery dirs were deleted. 
{code}
java.lang.NullPointerException
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.<init>(LocalResourceRequest.java:46)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
        at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
        at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
        at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
        at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
        at java.lang.Thread.run(Thread.java:745)
{code}

NPE occured when created LocalResourceRequest instance for invalid resource request.
{code}
  public LocalResourceRequest(LocalResource resource)
      throws URISyntaxException {
    this(resource.getResource().toPath(),  //NPE occurred here
        resource.getTimestamp(),
        resource.getType(),
        resource.getVisibility(),
        resource.getPattern());
  }
{code}

We can't guarantee the validity of local resource request now, but we could avoid damaging the cluster. Perhaps we can verify the resource both in ContainerLaunchContext and LocalResourceRequest? Please feel free to give your suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org