You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org> on 2016/01/13 03:36:39 UTC
[jira] [Updated] (YARN-4502) Sometimes Two AM containers get
launched
[ https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinod Kumar Vavilapalli updated YARN-4502:
------------------------------------------
Attachment: YARN-4502-20160212.txt
Here's a patch fixing this.
The core fix is relatively concentrated. We now perform kill-container and recover-requests in one shot without all the events (which was broken anyways).
Much of the remaining changes are for test-cases and a bit of renames and refactors.
Summary of the changes:
- AbstractYarnScheduler: Renamed existing completedContainer() method to be {{completedContainerInternal()}} and added a new {{completedContainer()}} which wraps around completedContainerInternal(), does common null-checks and also recovers ResourceRequests as needed
- SchedulerEventType.KILL_CONTAINER ->SchedulerEventType.KILL_PREEMPTED_CONTAINER and DROP_RESERVATION -> KILL_RESERVED_CONTAINER
- Got rid of ContainerRescheduledTransition completely. No need for container to send an event to the scheduler. Once this is removed, got rid of ContainerRescheduledEvent too.
- Moved ContainerPreemptEvent from org.apache.hadoop.yarn.server.resourcemanager.scheduler package into org.apache.hadoop.yarn.server.resourcemanager.scheduler.event. That is where it belongs.
- PreemptableResourceScheduler: Renamed APIs dropContainerReservation -> killReservedContainer and killContainer -> killPreemptedContainer
- FiCaSchedulerApp.addPreemptContainer -> preemptContainer
> Sometimes Two AM containers get launched
> ----------------------------------------
>
> Key: YARN-4502
> URL: https://issues.apache.org/jira/browse/YARN-4502
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Yesha Vora
> Assignee: Vinod Kumar Vavilapalli
> Priority: Critical
> Attachments: YARN-4502-20160212.txt
>
>
> Scenario :
> * set yarn.resourcemanager.am.max-attempts = 2
> * start dshell application
> {code}
> yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar hadoop-yarn-applications-distributedshell-*.jar -attempt_failures_validity_interval 60000 -shell_command "sleep 150" -num_containers 16
> {code}
> * Kill AM pid
> * Print container list for 2nd attempt
> {code}
> yarn container -list appattempt_1450825622869_0001_000002
> INFO impl.TimelineClientImpl: Timeline service address: http://xxx:port/ws/v1/timeline/
> INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10:<port>
> Total number of containers :2
> Container-Id Start Time Finish Time State Host Node Http Address LOG-URL
> container_e12_1450825622869_0001_02_000002 Tue Dec 22 23:07:35 +0000 2015 N/A RUNNING xxx:25454 http://xxx:8042 http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_000002/hrt_qa
> container_e12_1450825622869_0001_02_000001 Tue Dec 22 23:07:34 +0000 2015 N/A RUNNING xxx:25454 http://xxx:8042 http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_000001/hrt_qa
> {code}
> * look for new AM pid
> Here, 2nd AM container was suppose to be started on container_e12_1450825622869_0001_02_000001. But AM was not launched on container_e12_1450825622869_0001_02_000001. It was in AQUIRED state.
> On other hand, container_e12_1450825622869_0001_02_000002 got the AM running.
> Expected behavior: RM should not start 2 containers for starting AM
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)