You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Billie Rinaldi (JIRA)" <ji...@apache.org> on 2018/05/09 19:43:00 UTC

[jira] [Comment Edited] (YARN-8243) Flex down should first remove pending container requests (if any) and then kill running containers

    [ https://issues.apache.org/jira/browse/YARN-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16469368#comment-16469368 ] 

Billie Rinaldi edited comment on YARN-8243 at 5/9/18 7:42 PM:
--------------------------------------------------------------

I strongly feel we should not make the changes to FlexComponentTransition suggested in this patch because it will create holes. We should only make the compareTo fix, which will still allow the new unit test to pass.

Sorry, I didn't answer your actual question. With the changes to FlexComponentTransition, this can happen when comp1 is the only pending container. Then comp1 would be removed even though it is not the instance with the highest ID.


was (Author: billie.rinaldi):
I strongly feel we should not make the changes to FlexComponentTransition suggested in this patch because it will create holes. We should only make the compareTo fix, which will still allow the new unit test to pass.

> Flex down should first remove pending container requests (if any) and then kill running containers
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-8243
>                 URL: https://issues.apache.org/jira/browse/YARN-8243
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: yarn-native-services
>    Affects Versions: 3.1.0
>            Reporter: Gour Saha
>            Assignee: Gour Saha
>            Priority: Major
>         Attachments: YARN-8243.01.patch
>
>
> This is easy to test on a service with anti-affinity component, to simulate pending container requests. It can be simulated by other means also (no resource left in cluster, etc.).
> Service yarnfile used to test this -
> {code:java}
> {
>   "name": "sleeper-service",
>   "version": "1",
>   "components" :
>   [
>     {
>       "name": "ping",
>       "number_of_containers": 2,
>       "resource": {
>         "cpus": 1,
>         "memory": "256"
>       },
>       "launch_command": "sleep 9000",
>       "placement_policy": {
>         "constraints": [
>           {
>             "type": "ANTI_AFFINITY",
>             "scope": "NODE",
>             "target_tags": [
>               "ping"
>             ]
>           }
>         ]
>       }
>     }
>   ]
> }
> {code}
> Launch a service with the above yarnfile as below -
> {code:java}
> yarn app -launch simple-aa-1 simple_AA.json
> {code}
> Let's assume there are only 5 nodes in this cluster. Now, flex the above service to 1 extra container than the number of nodes (6 in my case).
> {code:java}
> yarn app -flex simple-aa-1 -component ping 6
> {code}
> Only 5 containers will be allocated and running for simple-aa-1. At this point, flex it down to 5 containers -
> {code:java}
> yarn app -flex simple-aa-1 -component ping 5
> {code}
> This is what is seen in the serviceam log at this point -
> {noformat}
> 2018-05-03 20:17:38,469 [IPC Server handler 0 on 38124] INFO  service.ClientAMService - Flexing component ping to 5
> 2018-05-03 20:17:38,469 [Component  dispatcher] INFO  component.Component - [FLEX DOWN COMPONENT ping]: scaling down from 6 to 5
> 2018-05-03 20:17:38,470 [Component  dispatcher] INFO  instance.ComponentInstance - [COMPINSTANCE ping-4 : container_1525297086734_0013_01_000006]: Flexed down by user, destroying.
> 2018-05-03 20:17:38,473 [Component  dispatcher] INFO  component.Component - [COMPONENT ping] Transitioned from FLEXING to STABLE on FLEX event.
> 2018-05-03 20:17:38,474 [pool-5-thread-8] INFO  registry.YarnRegistryViewForProviders - [COMPINSTANCE ping-4 : container_1525297086734_0013_01_000006]: Deleting registry path /users/root/services/yarn-service/simple-aa-1/components/ctr-1525297086734-0013-01-000006
> 2018-05-03 20:17:38,476 [Component  dispatcher] ERROR component.Component - [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: CHECK_STABLE at STABLE
> 	at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> 	at org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913)
> 	at org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574)
> 	at org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> 	at java.lang.Thread.run(Thread.java:745)
> 2018-05-03 20:17:38,480 [Component  dispatcher] ERROR component.Component - [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: CHECK_STABLE at STABLE
> 	at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> 	at org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913)
> 	at org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574)
> 	at org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> 	at java.lang.Thread.run(Thread.java:745)
> 2018-05-03 20:17:38,578 [pool-5-thread-8] INFO  instance.ComponentInstance - [COMPINSTANCE ping-4 : container_1525297086734_0013_01_000006]: Deleted component instance dir: hdfs://ctr-e138-1518143905142-280820-01-000003.example.site:8020/user/root/.yarn/services/simple-aa-1/components/1/ping/ping-4
> 2018-05-03 20:17:39,268 [AMRM Callback Handler Thread] WARN  service.ServiceScheduler - Container container_1525297086734_0013_01_000006 Completed. No component instance exists. exitStatus=-100. diagnostics=Container released by application 
> 2018-05-03 20:17:40,273 [AMRM Callback Handler Thread] INFO  service.ServiceScheduler - 1 containers allocated. 
> 2018-05-03 20:17:40,273 [AMRM Callback Handler Thread] INFO  service.ServiceScheduler - [COMPONENT ping]: remove 0 outstanding container requests for allocateId 0
> 2018-05-03 20:17:40,274 [Component  dispatcher] INFO  component.Component - [COMPONENT ping]: container_1525297086734_0013_01_000007 allocated, num pending component instances reduced to 0
> 2018-05-03 20:17:40,274 [Component  dispatcher] INFO  component.Component - [COMPONENT ping]: Assigned container_1525297086734_0013_01_000007 to component instance ping-5 and launch on host ctr-e138-1518143905142-280820-01-000008.example.site:25454 
> 2018-05-03 20:17:40,277 [pool-6-thread-6] INFO  provider.ProviderUtils - [COMPINSTANCE ping-5 : container_1525297086734_0013_01_000007]: Creating dir on hdfs: hdfs://ctr-e138-1518143905142-280820-01-000003.example.site:8020/user/root/.yarn/services/simple-aa-1/components/1/ping/ping-5
> 2018-05-03 20:17:40,316 [pool-6-thread-6] INFO  containerlaunch.ContainerLaunchService - launching container container_1525297086734_0013_01_000007
> 2018-05-03 20:17:40,318 [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #5] INFO  impl.NMClientAsyncImpl - Processing Event EventType: START_CONTAINER for Container container_1525297086734_0013_01_000007
> 2018-05-03 20:17:40,338 [Component  dispatcher] ERROR component.Component - [COMPONENT ping]: Invalid event CONTAINER_STARTED at STABLE
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: CONTAINER_STARTED at STABLE
> 	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> 	at org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913)
> 	at org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574)
> 	at org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> 	at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Status response shows that only 4 containers are running and the service is not in STABLE state -
> {code:java}
> yarn app -status simple-aa-1
> {code}
> output -
> {code:java}
> {
>     "components": [
>         {
>             "configuration": {
>                 "env": {},
>                 "files": [],
>                 "properties": {}
>             },
>             "containers": [
>                 {
>                     "bare_host": "ctr-e138-1518143905142-280820-01-000007.example.site",
>                     "component_instance_name": "ping-1",
>                     "hostname": "ctr-e138-1518143905142-280820-01-000007.example.site",
>                     "id": "container_1525297086734_0013_01_000003",
>                     "ip": "x.x.x.x",
>                     "launch_time": 1525378141535,
>                     "state": "READY"
>                 },
>                 {
>                     "bare_host": "ctr-e138-1518143905142-280820-01-000006.example.site",
>                     "component_instance_name": "ping-0",
>                     "hostname": "ctr-e138-1518143905142-280820-01-000006.example.site",
>                     "id": "container_1525297086734_0013_01_000002",
>                     "ip": "x.x.x.x",
>                     "launch_time": 1525378141513,
>                     "state": "READY"
>                 },
>                 {
>                     "bare_host": "ctr-e138-1518143905142-280820-01-000005.example.site",
>                     "component_instance_name": "ping-3",
>                     "hostname": "ctr-e138-1518143905142-280820-01-000005.example.site",
>                     "id": "container_1525297086734_0013_01_000005",
>                     "ip": "x.x.x.x",
>                     "launch_time": 1525378303429,
>                     "state": "READY"
>                 },
>                 {
>                     "bare_host": "ctr-e138-1518143905142-280820-01-000004.example.site",
>                     "component_instance_name": "ping-2",
>                     "hostname": "ctr-e138-1518143905142-280820-01-000004.example.site",
>                     "id": "container_1525297086734_0013_01_000004",
>                     "ip": "x.x.x.x",
>                     "launch_time": 1525378303425,
>                     "state": "READY"
>                 }
>             ],
>             "dependencies": [],
>             "launch_command": "sleep 9000",
>             "name": "ping",
>             "number_of_containers": 5,
>             "placement_policy": {
>                 "constraints": [
>                     {
>                         "node_attributes": {},
>                         "node_partitions": [],
>                         "scope": "NODE",
>                         "target_tags": [
>                             "ping"
>                         ],
>                         "type": "ANTI_AFFINITY"
>                     }
>                 ]
>             },
>             "quicklinks": [],
>             "resource": {
>                 "additional": {},
>                 "cpus": 1,
>                 "memory": "256"
>             },
>             "run_privileged_container": false,
>             "state": "FLEXING"
>         }
>     ],
>     "configuration": {
>         "env": {},
>         "files": [],
>         "properties": {}
>     },
>     "id": "application_1525297086734_0013",
>     "kerberos_principal": {},
>     "lifetime": -1,
>     "name": "simple-aa-1",
>     "quicklinks": {},
>     "state": "STARTED",
>     "version": "1"
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org