You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Varun Vasudev (JIRA)" <ji...@apache.org> on 2016/04/07 16:52:26 UTC
[jira] [Comment Edited] (YARN-4876) [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop

    [ https://issues.apache.org/jira/browse/YARN-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230349#comment-15230349 ] 

Varun Vasudev edited comment on YARN-4876 at 4/7/16 2:52 PM:
-------------------------------------------------------------

Thanks for the document [~asuresh]!

Here are my initial thoughts -

{code} Add int field 'destroyDelay' to each 'StartContainerRequest':{code}

I think we should avoid this for now - we should require that AMs that use initialize() must call destroy and AMs that call start with the ContainerLaunchContext can't call destroy. We can achieve that by adding the destroyDelay field you mentioned in your document but don't allow AMs to set it. If initialize is called, set destroyDelay internally to \-1, else to 0. I'm not saying we should drop the feature, just that we should come back to it once we've sorted out the lifecycle from an initialize->destroy perspective.

{code}
Modify 'StopContainerRequest' Record:
  Add boolean 'destroyContainer':
{code}
Similar to above - let's avoid mixing initialize/destroy with start/stop for now.

{code}
• Introduce a new 'ContainerEventType.START_CONTAINER' event type.
• Introduce a new 'ContainerEventType.DESTROY_CONTAINER' event type.
• The Container remains in the LOCALIZED state until it receives the 'START_CONTAINER' event.
{code}

Can you add a state machine transition diagram to explain how new states and events affect each other?

{code}
If 'initializeContainer' with a new ContainerLaunchContext is called by the AM while the Container
is RUNNING, It is treated as a KILL_CONTAINER event followed by a CONTAINER_RESOURCE_CLEANUP and an INIT_CONTAINER event to kick of re-localization after which the Container will return to LOCALIZED state.
{code}
I'd really like to avoid this specific behavior. I think we should add an explicit re-initialize/re-localize API. For a running process, ideally, we want to localize the upgraded bits while the container is running and then kill the existing process to minimize the downtime. For containers where localization can take a long time, forcing a kill and then a re-initialize adds quite a serious amount of downtime. Re-initialize and initialize will probably end up having differing behaviors. On a similar note, I think we might have to introduce a new "re-initalizing/re-localizing/running-localizing state" which implies that a container is running but we are carrying out some background work.
In addition, I don't think we can do a cleanup of resources during an upgrade. For services that have local state in the container work dir, we're essentially wiping away all the local state and forcing them to start from scratch.
Just a clarification, when you mentioned CONTAINER_RESOURCE_CLEANUP , I'm assuming you meant CLEANUP_CONTAINER_RESOURCES

{code}
• If 'intializeContainer' is called WITHOUT a new ContainerLaunchContext by the AM, it is considered a restart, and will follow the same code path as 'initializeContainer' with new ContainerLaunchContext, but will not perform a CONTAINER_RESOURCE_CLEANUP and INIT_CONTAINER. The Container process will be killed and the container will be returned to LOCALIZED state.
• If 'startContainer' is called WITHOUT a new ContainerLaunchContext by the AM, it treated exactly as the above case, but it will also trigger a START_CONTAINER event.
{code}
Instead of forcing AMs to make two calls, why don't we just add a restart API that does everything you've outlined above? It's cleaner and we don't have to do as many condition checks. In addition, with a restart API we can do stuff like allowing AMs to specify a delay, or some conditions when the restart should happen.


was (Author: vvasudev):
Thanks for the document [~asuresh]!

Here are my initial thoughts -

{code} Add int field 'destroyDelay' to each 'StartContainerRequest':{code}

I think we should avoid this for now - we should require that AMs that use initialize() must call destroy and AMs that call start with the ContainerLaunchContext can't call destroy. We can achieve that by adding the destroyDelay field you mentioned in your document but don't allow AMs to set it. If initialize is called, set destroyDelay internally to \-1, else to 0. I'm not saying we should drop the feature, just that we should come back to it once we've sorted out the lifecycle from an initialize->destroy perspective.

{code}
Modify 'StopContainerRequest' Record:
  Add boolean 'destroyContainer':
{code}
Similar to above - let's avoid mixing initialize/destroy with start/stop for now.

{code}
• Introduce a new 'ContainerEventType.START_CONTAINER' event type.
• Introduce a new 'ContainerEventType.DESTROY_CONTAINER' event type.
• The Container remains in the LOCALIZED state until it receives the 'START_CONTAINER' event.
{code}

Can you add a state machine transition diagram to explain how new states and events affect each other?

{code}
If 'initializeContainer' with a new ContainerLaunchContext is called by the AM while the Container
is RUNNING, It is treated as a KILL_CONTAINER event followed by a CONTAINER_RESOURCE_CLEANUP and an INIT_CONTAINER event to kick of re-localization after which the Container will return to LOCALIZED state.
{code}
I'd really like to avoid this specific behavior. I think we should add an explicit re-initialize API. For a running process, ideally, we want to localize the upgraded bits while the container is running and then kill the existing process to minimize the downtime. For containers where localization can take a long time, forcing a kill and then a re-initialize adds quite a serious amount of downtime. Re-initialize and initialize will probably end up having differing behaviors. On a similar note, I think we might have to introduce a new "re-initalizing/re-localizing/running-localizing state" which implies that a container is running but we are carrying out some background work.
In addition, I don't think we can do a cleanup of resources during an upgrade. For services that have local state in the container work dir, we're essentially wiping away all the local state and forcing them to start from scratch.
Just a clarification, when you mentioned CONTAINER_RESOURCE_CLEANUP , I'm assuming you meant CLEANUP_CONTAINER_RESOURCES

{code}
• If 'intializeContainer' is called WITHOUT a new ContainerLaunchContext by the AM, it is considered a restart, and will follow the same code path as 'initializeContainer' with new ContainerLaunchContext, but will not perform a CONTAINER_RESOURCE_CLEANUP and INIT_CONTAINER. The Container process will be killed and the container will be returned to LOCALIZED state.
• If 'startContainer' is called WITHOUT a new ContainerLaunchContext by the AM, it treated exactly as the above case, but it will also trigger a START_CONTAINER event.
{code}
Instead of forcing AMs to make two calls, why don't we just add a restart API that does everything you've outlined above? It's cleaner and we don't have to do as many condition checks. In addition, with a restart API we can do stuff like allowing AMs to specify a delay, or some conditions when the restart should happen.

> [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop
> ------------------------------------------------------------------
>
>                 Key: YARN-4876
>                 URL: https://issues.apache.org/jira/browse/YARN-4876
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: Arun Suresh
>         Attachments: YARN-4876-design-doc.pdf
>
>
> Introduce *initialize* and *destroy* container API into the *ContainerManagementProtocol* and decouple the actual start of a container from the initialization. This will allow AMs to re-start a container without having to lose the allocation.
> Additionally, if the localization of the container is associated to the initialize (and the cleanup with the destroy), This can also be used by applications to upgrade a Container by *re-initializing* with a new *ContainerLaunchContext*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)