You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Naganarasimha G R (JIRA)" <ji...@apache.org> on 2016/03/15 05:07:33 UTC

[jira] [Updated] (YARN-4711) NM is going down with NPE's due to single thread processing of events by Timeline client

     [ https://issues.apache.org/jira/browse/YARN-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Naganarasimha G R updated YARN-4711:
------------------------------------
    Attachment: 4711Analysis.txt

Hi [~sjlee0], 
Sorry for the long delay ! 
>From the analysis i am able to identify that it happens on any exception from the web server while publishing the entity. Earlier i suspected it might be due to the time taken to publish but other imp cause could be we are retrying to publish the entity irrespective of the exception type.
So basically {{ContainerManagerImpl.ContainerEventDispatcher.handle(ContainerEvent)}} -> {{nmMetricsPublisher.publishContainerEvent}} -> {{NMTimelinePublisher.ContainerEventHandler.handle(ContainerEvent)}}. So synchronous container metric event dispatching in {{NMTimelinePublisher.dispatcher}} is getting slowed down as {{TimelineClientImpl.putObjects}} is retrying on exception.
  


> NM is going down with NPE's due to single thread processing of events by Timeline client
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-4711
>                 URL: https://issues.apache.org/jira/browse/YARN-4711
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Naganarasimha G R
>            Assignee: Naganarasimha G R
>            Priority: Critical
>              Labels: yarn-2928-1st-milestone
>         Attachments: 4711Analysis.txt
>
>
> After YARN-3367, while testing the latest 2928 branch came across few NPEs due to which NM is shutting down.
> {code}
> 2016-02-21 23:19:54,078 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
>         at org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:306)
>         at org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ContainerEventHandler.handle(NMTimelinePublisher.java:296)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}
> java.lang.NullPointerException
>         at org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:213)
>         at org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerFinishedEvent(NMTimelinePublisher.java:192)
>         at org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.access$400(NMTimelinePublisher.java:63)
>         at org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:289)
>         at org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher$ApplicationEventHandler.handle(NMTimelinePublisher.java:280)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> On analysis found that the there was delay in processing of events, as after YARN-3367 all the events were getting processed by a single thread inside the timeline client. 
> Additionally found one scenario where there is possibility of NPE:
> * TimelineEntity.toString() when {{real}} is not null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)