You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Arun Suresh (JIRA)" <ji...@apache.org> on 2016/06/14 20:56:30 UTC

[jira] [Comment Edited] (YARN-4876) [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop

    [ https://issues.apache.org/jira/browse/YARN-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15330496#comment-15330496 ] 

Arun Suresh edited comment on YARN-4876 at 6/14/16 8:56 PM:
------------------------------------------------------------

Aggregating and posting some design points on the patch based on offline discussions with [~marco.rabozzi] :

h4. ContainerImpl state machine
In the current patch, containers that are initialized using the new initializeContainers APIs keep waiting for startContainers requests within the LOCALIZED state after resource localization. When the START_CONTAINER event is generated upon request from the application master, the container transits to a new LAUNCHING state waiting for a CONTAINER_LAUNCHED event (this is fired asynchronously by ContainerLaunch when the container process is being started). Upon receiving the CONTAINER_LAUNCHED event, the container state is updated to RUNNING. For containers that do not allow multi-start (i.e. those that are initialized and started using the standard startContainers API), the START_CONTAINER event is automatically sent after localization.

The role of the new “LAUNCHING” state is to make a clear distinction between the following two situations:
# The container has been localized and is waiting for a start request (LOCALIZED state)
# The container has received a start request and it is being started (LAUNCHING state)
In this fashion, we can allow a start (or a restart) of an idle container only if the container is in the LOCALIZED state and if it allows multi-start. 

From a first analysis, it seems that the new LAUNCHING state and the already present RELAUNCHING state could by merged into a single LAUNCHING state to reduce the state machine complexity.

The destroyContainers API is equivalent to stopContainers if the specified containers do not allow multi-start. On the other hand, in case of a container that allows multi-start, the stopContainers API kills the container process and reverts the container state machine to “LOCALIZED”. However, in order to properly catch the termination of a container process for which a stop request has been issued, an additional “STOPPING” state has been inserted. If the container is in RUNNING state and it allows multi-start, the application master can issue a stopContainers request upon which the container state is updated to STOPPING and an asynchronous request to kill the container process is sent. Within the stopping state, similarly to the KILLING state, the container termination events (CONTAINER_EXITED_WITH_SUCCESS, CONTAINER_KILLED_ON_REQUEST, CONTAINER_EXITED_WITH_FAILURE) are considered as a successful container stop, upon which the container state reverts to LOCALIZED.

h4. Working directory cleanup
When a container is in the LOCALIZED state and multi-start is enabled, the application master can issue the following 3 new types of requests:
# StartContainers (ContainerLaunchContext == NULL)
# InitializeContainers
# StartContainers (ContainerLaunchContext != NULL)

In case 1) the container is simply started using the ContainerLaunchContext issued in the previous InitializeContainers request (the state machine transitions for this case are the ones described in the previous section). Case 2) and 3) both perform reinitialization and relocalization of container resources, the only difference between 2) and 3) is that in 3) the container is also started after relocalization. Currently, when the container is reinitialized, the container working directory is deleted to ensure a clean state for the subsequent container starts. Actually, we could relax this behavior and allow the application master to specify a deletion policy for container reinitialization. Depending on the requirements we might want to address this aspect here or in a follow up JIRA.

h4. Log handling
Currently, there is no special handling of logs for a restarted container. The application master can decide either to append the new logs to the old ones or overwrite the old logs. This can be simply achieved by changing the launch command (e.g. in Linux use “>>” to append and “>” to overwrite).

h4. Token expiration
Both the InitializeContainers and the StartContainers APIs require a container token to authorize the request. For long running containers, the token might expire and the application master won’t be able to request a restart or a reinitialization of a container. This limitation currently holds also for the IncreaseContainerResource API. We might need to address container token renewal in a separated JIRA.

h4. Recovery for container that allows multi-start
The current patch does not fully support recovery of containers that allows multi-start. Indeed, after a restart of the NodeManager, if the container is not running, the NodeManager cannot distinguish between a stopped container waiting for start or a container that completed its execution successfully. Additional information in the state store might be needed to handle this case.

h4. Auxiliary Service Data
In the current YARN implementation, a CONTAINER_INIT and a APPLICATION_INIT events are sent to the auxiliary services every time a new container is initialized. With the new initializeContainers API, it is possible to reinitialized a container multiple times even without actually starting it. The actual implementation of the patch sends a CONTAINER_INIT and an APPLICATION_INIT event for every reinitialization of a container (potentially sending new data to the auxiliary services). We should verify weather this behavior is correct or needs to be modified.

h4. Container failures handling
In the current patch implementation, if a container fails during a reinitialization, the container is destroyed. On the other hand, if the container fails within the STOPPING state, this is considered as a successful stop. Should we allow the application master to specify a policy for failures behaviors for stopping and reinitializing?

h4. 'Container Destroy' monitor
The proposed patch allows the application master to specify a destroyDelay after which an idle container is destroyed automatically if not started within a given timeout. The destroy logic is still not implemented in the current patch. We might need to implement a “destroy containers monitor” service to check for container to destroy after a configurable time interval. 

h4. Uploaded resource
During container relocalization, do we need specific logic for resources that are uploaded to the shared cache? Currently, before localizing the new resources, the old container local resources are released. Do we have to clean also the resourcesUploadPolicies map of ContainerImpl during relocalization?



was (Author: asuresh):
Aggregating and posting some design points on the patch based on offline discussions with [~marco.rabozzi] :

h4. ContainerImpl state machine
In the current patch, containers that are initialized using the new initializeContainers APIs keep waiting for startContainers requests within the LOCALIZED state after resource localization. When the START_CONTAINER event is generated upon request from the application master, the container transits to a new LAUNCHING state waiting for a CONTAINER_LAUNCHED event (this is fired asynchronously by ContainerLaunch when the container process is being started). Upon receiving the CONTAINER_LAUNCHED event, the container state is updated to RUNNING. For containers that do not allow multi-start (i.e. those that are initialized and started using the standard startContainers API), the START_CONTAINER event is automatically sent after localization.

The role of the new “LAUNCHING” state is to make a clear distinction between the following two situations:
# The container has been localized and is waiting for a start request (LOCALIZED state)
# The container has received a start request and it is being started (LAUNCHING state)
In this fashion, we can allow a start (or a restart) of an idle container only if the container is in the LOCALIZED state and if it allows multi-start. 

From a first analysis, it seems that the new LAUNCHING state and the already present RELAUNCHING state could by merged into a single LAUNCHING state to reduce the state machine complexity.

The destroyContainers API is equivalent to stopContainers if the specified containers do not allow multi-start. On the other hand, in case of a container that allows multi-start, the stopContainers API kills the container process and reverts the container state machine to “LOCALIZED”. However, in order to properly catch the termination of a container process for which a stop request has been issued, an additional “STOPPING” state has been inserted. If the container is in RUNNING state and it allows multi-start, the application master can issue a stopContainers request upon which the container state is updated to STOPPING and an asynchronous request to kill the container process is sent. Within the stopping state, similarly to the KILLING state, the container termination events (CONTAINER_EXITED_WITH_SUCCESS, CONTAINER_KILLED_ON_REQUEST, CONTAINER_EXITED_WITH_FAILURE) are considered as a successful container stop, upon which the container state reverts to LOCALIZED.

h4. Working directory cleanup
When a container is in the LOCALIZED state and multi-start is enabled, the application master can issue the following 3 new types of requests:
# StartContainers (ContainerLaunchContext == NULL)
# InitializeContainers
# StartContainers (ContainerLaunchContext != NULL)

In case 1) the container is simply started using the ContainerLaunchContext issued in the previous InitializeContainers request (the state machine transitions for this case are the ones described in the previous section). Case 2) and 3) both perform reinitialization and relocalization of container resources, the only difference between 2) and 3) is that in 3) the container is also started after relocalization. Currently, when the container is reinitialized, the container working directory is deleted to ensure a clean state for the subsequent container starts. Actually, we could relax this behavior and allow the application master to specify a deletion policy for container reinitialization. Depending on the requirements we might want to address this aspect here or in a follow up JIRA.

h4. Log handling
Currently, there is no special handling of logs for a restarted container. The application master can decide either to append the new logs to the old ones or overwrite the old logs. This can be simply achieved by changing the launch command (e.g. in Linux use “>>” to append and “>” to overwrite).

h4. Token expiration
Both the InitializeContainers and the StartContainers APIs require a container token to authorize the request. For long running containers, the token might expire and the application master won’t be able to request a restart or a reinitialization of a container. This limitation currently holds also for the IncreaseContainerResource API. We might need to address container token renewal in a separated JIRA.

h4. Recovery for container that allows multi-start
The current patch does not fully support recovery of containers that allows multi-start. Indeed, after a restart of the NodeManager, if the container is not running, the NodeManager cannot distinguish between a stopped container waiting for start or a container that completed its execution successfully. Additional information in the state store might be needed to handle this case.

h4. Auxiliary Service Data
In the current YARN implementation, a CONTAINER_INIT and a APPLICATION_INIT events are sent to the auxiliary services every time a new container is initialized. With the new initializeContainers API, it is possible to reinitialized a container multiple times even without actually starting it. The actual implementation of the patch sends a CONTAINER_INIT and an APPLICATION_INIT event for every reinitialization of a container (potentially sending new data to the auxiliary services). We should verify weather this behavior is correct or needs to be modified.

h4. Container failures handling
In the current patch implementation, if a container fails during a reinitialization, the container is destroyed. On the other hand, if the container fails within the STOPPING state, this is considered as a successful stop. Should we allow the application master to specify a policy for failures behaviors for stopping and reinitializing?

h4. Destroy container monitor
The proposed patch allows the application master to specify a destroyDelay after which an idle container is destroyed automatically if not started within a given timeout. The destroy logic is still not implemented in the current patch. We might need to implement a “destroy containers monitor” service to check for container to destroy after a configurable time interval. 

h4. Uploaded resource
During container relocalization, do we need specific logic for resources that are uploaded to the shared cache? Currently, before localizing the new resources, the old container local resources are released. Do we have to clean also the resourcesUploadPolicies map of ContainerImpl during relocalization?


> [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop
> ------------------------------------------------------------------
>
>                 Key: YARN-4876
>                 URL: https://issues.apache.org/jira/browse/YARN-4876
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: Marco Rabozzi
>         Attachments: YARN-4876-design-doc.pdf, YARN-4876.002.patch, YARN-4876.01.patch
>
>
> Introduce *initialize* and *destroy* container API into the *ContainerManagementProtocol* and decouple the actual start of a container from the initialization. This will allow AMs to re-start a container without having to lose the allocation.
> Additionally, if the localization of the container is associated to the initialize (and the cleanup with the destroy), This can also be used by applications to upgrade a Container by *re-initializing* with a new *ContainerLaunchContext*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org