You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "lujie (JIRA)" <ji...@apache.org> on 2018/08/21 15:34:00 UTC

[jira] [Assigned] (YARN-8649) Similar as YARN-4355:NPE while processing localizer heartbeat

     [ https://issues.apache.org/jira/browse/YARN-8649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

lujie reassigned YARN-8649:
---------------------------

      Assignee: lujie
    Attachment: YARN-8649.patch

Hi [~jlowe], [~pradeepambati],[~$iddhe$h]

I have restudied the bug according the logs.

*The root cause:*
 # When NM shutdowns, it will sent KILL_CONTAINER to the Container, The log has shown this event:

{code:java}
2018-08-21 20:11:08,316 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1534853453424_0001_01_000001 transitioned from LOCALIZING to KILLING
{code}
this will led the KillBeforeRunningTransition to execute.
 # In KillBeforeRunningTransition, it will call "container.cleanup()", and in "cleanup" function, it will sent "ContainerLocalizationCleanupEvent".
 # ContainerLocalizationCleanupEvent will cause the ResourceLocalizationService.handleCleanupContainerResources to execute, and in "handleCleanupContainerResources", it  will send  "ResourceReleaseEvent".
 # ResourceReleaseEvent will led cause the LocalResourcesTrackerImpl.handle to execute, and in handle(at line 199in source code) it will call removeResouce:

{code:java}
if (event.getType() == ResourceEventType.RELEASE) {
    if (rsrc.getState() == ResourceState.DOWNLOADING &&
        rsrc.getRefCount() <= 0 &&
        rsrc.getRequest().getVisibility() != LocalResourceVisibility.PUBLIC) {
        removeResource(req);
    }
}
{code}



 # in removeResouce, it will do:

{code:java}
LocalizedResource rsrc = localrsrc.remove(req);
{code}

 # when heartbeat come in, the LocalResourcesTrackerImpl.getPathForLocalization will  do:

{code:java}
Path localPath = new Path(rPath, req.getPath().getName());
LocalizedResource rsrc = localrsrc.get(req);//rsec is null
rsrc.setLocalPath(localPath);//NPE
{code}
NPE happens!

*Unit test:*


While fixing YARN-4355, the patch added the test "testLocalizerHeartbeatWhenAppCleaningUp" in Class "TestResourceLocalizationService"

In the test, it also send the "ContainerLocalizationCleanupEvent", but the test doesn't  cover that heartbeat can comes at this moment.

In this patch, we change the "testLocalizerHeartbeatWhenAppCleaningUp" to cover this situation. This change will trigger the bug.

 

Fixing:

When we fix the NPE, we only add null check, i think it is suitable here!

> Similar as YARN-4355:NPE while processing localizer heartbeat
> -------------------------------------------------------------
>
>                 Key: YARN-8649
>                 URL: https://issues.apache.org/jira/browse/YARN-8649
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.1.1
>            Reporter: lujie
>            Assignee: lujie
>            Priority: Major
>         Attachments: YARN-8649.patch, hadoop-hires-nodemanager-hadoop11.log
>
>
> I have noticed that a nodemanager was getting NPEs while tearing down. The reason maybe  similar to YARN-4355 which is reported by [# Jason Lowe]. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org