You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Vrushali C (JIRA)" <ji...@apache.org> on 2017/07/21 02:25:00 UTC

[jira] [Commented] (YARN-6323) Rolling upgrade/config change is broken on timeline v2.

    [ https://issues.apache.org/jira/browse/YARN-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095682#comment-16095682 ] 

Vrushali C commented on YARN-6323:
----------------------------------

Ping on this jira. To summarize:

- new NM fails to recover apps since the timeline flow context is missing for old apps on the NM. This patch will put in a default flow context to help NM proceed. 

To answer Rohith's questions:

bq Application is NOT submitted with tags. So default values are created by YARN.
RM creates default FlowContext with FlowName as appName. On NM restart, we are creating FlowContex with appId. So, there will be a inconsistencies when entities are published during rolling upgrade.
Yes, inconsistencies would be there but it is not possible to upgrade the RM and the all the NMs at exactly the time, unless we take a downtime. 

bq. Assume that Application is submitted with some tags. RM recover the application and start publishing with tags as flow context. Again there is inconsistencies in published entity.
Yes, but how to synchronize RM and NM across restarts? We could use app id in both cases but this turns out to be strange default data.   

This patch will ensure the NM does not fail to start up.  I thought of adding in some default values for dropping the data but that will be an expensive check to do each time we want to write to the backend. 

ping [~rohithsharma] [~varun_saxena] [~haibo.chen]  any other ideas? At the very least, the NM can't be crashing during an upgrade due to missing flow context. 


> Rolling upgrade/config change is broken on timeline v2. 
> --------------------------------------------------------
>
>                 Key: YARN-6323
>                 URL: https://issues.apache.org/jira/browse/YARN-6323
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Li Lu
>            Assignee: Vrushali C
>              Labels: yarn-5355-merge-blocker
>         Attachments: YARN-6323.001.patch
>
>
> Found this issue when deploying on real clusters. If there are apps running when we enable timeline v2 (with work preserving restart enabled), node managers will fail to start due to missing app context data. We should probably assign some default names to these "left over" apps. I believe it's suboptimal to let users clean up the whole cluster before enabling timeline v2. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org