You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by manoj <ma...@gmail.com> on 2015/08/15 01:53:35 UTC

Map tasks keep Running even after the node is killed on Apache Yarn.

Hi,

I'm on Apache2.6.0 YARN and I'm trying to test the dynamic addition and
removal of nodes from the Cluster.

The test starts a Job with 2 nodes and while the Job is progressing, It
removes one of the node* by killing the dataNode and NodeManager Daemons.(
is it ok to remove a node like this? )

*this node is not running ResourceManager/ApplicationMaster for sure.

After the node is successfully removed( I can confirm this from resource
manager logs- attached) the test adds it back and waits till the job
completes.

Node Removal Logs:

2015-08-14 11:15:56,902 INFO
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor:
Expired:host172:36158 Timed out after 60 secs
2015-08-14 11:15:56,903 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:
Deactivating Node host172:36158 as it is now LOST
2015-08-14 11:15:56,904 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:
host172:36158 Node Transitioned from RUNNING to LOST
2015-08-14 11:15:56,905 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1439575616861_0001_01_000006 Container Transitioned from
RUNNING to KILLED
2015-08-14 11:15:56,906 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
Completed container: container_1439575616861_0001_01_000006 in state:
KILLED event:KILL
2015-08-14 11:15:56,906 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=hadoop   OPERATION=AM Released Container TARGET=SchedulerApp
RESULT=SUCCESS  APPID=application_1439575616861_0001
CONTAINERID=container_1439575616861_0001_01_000006
2015-08-14 11:15:56,906 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:
Released container container_1439575616861_0001_01_000006 of capacity
<memory:1024, vCores:1> on host host172:36158, which currently has 1
containers, <memory:1024, vCores:1> used and <memory:1024, vCores:7>
available, release resources=true
2015-08-14 11:15:56,906 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
default used=<memory:3584, vCores:3> numContainers=3 user=hadoop
user-resources=<memory:3584, vCores:3>
2015-08-14 11:15:56,906 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
completedContainer container=Container: [ContainerId:
container_1439575616861_0001_01_000006, NodeId: host172:36158,
NodeHttpAddress: host172:8042, Resource: <memory:1024, vCores:1>,
Priority: 20, Token: Token { kind: ContainerToken, service:
XX.XX.0.2:36158 }, ] queue=default: capacity=1.0,
absoluteCapacity=1.0, usedResources=<memory:3584, vCores:3>,
usedCapacity=1.75, absoluteUsedCapacity=1.75, numApps=1,
numContainers=3 cluster=<memory:2048, vCores:8>
2015-08-14 11:15:56,906 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
completedContainer queue=root usedCapacity=1.75
absoluteUsedCapacity=1.75 used=<memory:3584, vCores:3>
cluster=<memory:2048, vCores:8>
2015-08-14 11:15:56,906 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
Re-sorting completed queue: root.default stats: default: capacity=1.0,
absoluteCapacity=1.0, usedResources=<memory:3584, vCores:3>,
usedCapacity=1.75, absoluteUsedCapacity=1.75, numApps=1,
numContainers=3
2015-08-14 11:15:56,906 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Application attempt appattempt_1439575616861_0001_000001 released
container container_1439575616861_0001_01_000006 on node: host:
host172:36158 #containers=1 available=1024 used=1024 with event: KILL
2015-08-14 11:15:56,907 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1439575616861_0001_01_000005 Container Transitioned from
RUNNING to KILLED
2015-08-14 11:15:56,907 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
Completed container: container_1439575616861_0001_01_000005 in state:
KILLED event:KILL
2015-08-14 11:15:56,907 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=hadoop   OPERATION=AM Released Container TARGET=SchedulerApp
RESULT=SUCCESS  APPID=application_1439575616861_0001
CONTAINERID=container_1439575616861_0001_01_000005
2015-08-14 11:15:56,907 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:
Released container container_1439575616861_0001_01_000005 of capacity
<memory:1024, vCores:1> on host host172:36158, which currently has 0
containers, <memory:0, vCores:0> used and <memory:2048, vCores:8>
available, release resources=true
2015-08-14 11:15:56,907 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
default used=<memory:2560, vCores:2> numContainers=2 user=hadoop
user-resources=<memory:2560, vCores:2>
2015-08-14 11:15:56,907 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
completedContainer container=Container: [ContainerId:
container_1439575616861_0001_01_000005, NodeId: host172:36158,
NodeHttpAddress: host172:8042, Resource: <memory:1024, vCores:1>,
Priority: 20, Token: Token { kind: ContainerToken, service:
XX.XX.0.2:36158 }, ] queue=default: capacity=1.0,
absoluteCapacity=1.0, usedResources=<memory:2560, vCores:2>,
usedCapacity=1.25, absoluteUsedCapacity=1.25, numApps=1,
numContainers=2 cluster=<memory:2048, vCores:8>
2015-08-14 11:15:56,907 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
completedContainer queue=root usedCapacity=1.25
absoluteUsedCapacity=1.25 used=<memory:2560, vCores:2>
cluster=<memory:2048, vCores:8>
2015-08-14 11:15:56,907 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
Re-sorting completed queue: root.default stats: default: capacity=1.0,
absoluteCapacity=1.0, usedResources=<memory:2560, vCores:2>,
usedCapacity=1.25, absoluteUsedCapacity=1.25, numApps=1,
numContainers=2
2015-08-14 11:15:56,907 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Application attempt appattempt_1439575616861_0001_000001 released
container container_1439575616861_0001_01_000005 on node: host:
host172:36158 #containers=0 available=2048 used=0 with event: KILL
2015-08-14 11:15:56,907 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Removed node host172:36158 clusterResource: <memory:2048, vCores:8>

Node Addition logs:

2015-08-14 11:19:43,529 INFO org.apache.hadoop.yarn.util.RackResolver:
Resolved host172 to /default-rack
2015-08-14 11:19:43,530 INFO
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService:
NodeManager from node host172(cmPort: 59426 httpPort: 8042) registered
with capability: <memory:2048, vCores:8>, assigned nodeId
host172:59426
2015-08-14 11:19:43,533 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:
host172:59426 Node Transitioned from NEW to RUNNING
2015-08-14 11:19:43,535 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Added node host172:59426 clusterResource: <memory:4096, vCores:16>

*Here's the problem:*

The Job never completes! According to the logs the mapTasks which were
scheduled on the node that was removed are still "RUNNING" with a
mapProgress of 100%. These tasks stays in the same state forever.

In the AppMasterContainer logs I see that it continuously tries to connect
to the previous node host172/XX.XX.XX.XX:36158 though it was removed and
added on a different port host172/XX.XX.XX.XX:59426

......
......
2015-08-14 11:25:21,662 INFO [ContainerLauncher #7]
org.apache.hadoop.ipc.Client: Retrying connect to server:
host172/XX.XX.XX.XX:36158. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
MILLISECONDS)
......
......

Please let me know if you need to see any more logs.

P.S: The Jobs completes normally without dynamic addition and removal of
nodes on the same Cluster with same memory settings.
Thanks,
--Manoj Kumar M

Re: Non recapitabile: Map tasks keep Running even after the node is killed on Apache Yarn.

Posted by manoj <ma...@gmail.com>.

Looks like the App master tries to connect to the Node that was removed for
~30Min before it gives up.
Can I reduce this wait time and number of tries?


Thanks
-Manoj

On Fri, Aug 14, 2015 at 4:54 PM, <po...@scai.it> wrote:

> *Il recapito non è riuscito per i seguenti destinatari o gruppi:*
>
> bigdatagroup@itecons.it
> La cassetta postale del destinatario è piena e non può accettare messaggi
> in questo momento. Riprova a inviare il messaggio più tardi.
>
> Il tuo messaggio è stato rifiutato dalla seguente organizzazione:
> MAIL2.scai.intra.
>
>
>
>
>
>
> *Informazioni di diagnostica per gli amministratori:*
>
> Server di generazione: mail2.scai.intra
>
> bigdatagroup@itecons.it
> MAIL2.scai.intra
> Remote Server returned '554 5.2.2 mailbox full;
> STOREDRV.Deliver.Exception:QuotaExceededException.MapiExceptionShutoffQuotaExceeded;
> Failed to process message due to a permanent exception with message Non
> riesco ad aprire la cassetta postale /o=Scai SpA/ou=Exchange Administrative
> Group (FYDIBOHF23SPDLT)/cn=Configuration/cn=Servers/cn=MAIL2/cn=Microsoft
> System Attendant. 16.55847:7E000000,
> 17.43559:0000000048010000000000002E00000000000000,
> 20.52176:000FE3832000000000000000, 20.50032:000FE3839017000000000000,
> 255.23226:000FE383, 255.27962:FE000000, 255.17082:DD040000,
> 0.26937:00000000, 4.21921:DD040000, 255.27962:FA000000, 255.1494:00000000,
> 0.50608:00000000,
> 5.29818:0000000034666631356633392D393430392D346230382D386264342D37666561303430643232623900000000,
> 1.29920:03000000, 7.29828:0A87B90E0000000000000000,
> 7.29832:0054B80E0000000000000000, 4.45884:DD040000, 4.29880:DD040000,
> 4.29888:DD040000, 1.56872:FE000000, 4.42712:DD040000,
> 5.10786:0000000031352E30302E303939352E3033323A6D61696C3200000000,
> 255.1750:000FE383, 0.26849:00000000, 255.21817:DD040000, 0.26297:0A000000,
> 4.16585:DD040000, 0.32441:00000000, 4.1706:DD040000, 0.24761:0A000000,
> 4.20665:DD040000, 0.25785:00000000, 4.29881:DD040000 [Stage: CreateSession]'
>
> Intestazioni originali del messaggio:
>
> Received: from MAIL2.scai.intra (10.110.4.14) by mail2.scai.intra
>  (10.110.4.14) with Microsoft SMTP Server (TLS) id 15.0.995.29; Sat, 15 Aug
>  2015 01:54:02 +0200
> Received: from mail.grupposcai.it (10.110.4.1) by MAIL2.scai.intra
>  (10.110.4.14) with Microsoft SMTP Server id 15.0.995.29 via Frontend
>  Transport; Sat, 15 Aug 2015 01:54:02 +0200
> X-BYPSHEADER: 12196482
> X-SMScore: -100000
> X-LCID: 9520602
> Received: from [(10.110.4.1)] by GTW-DMZ with Xeams SMTP; Sat, 15 Aug 2015 01:53:57 +0200 (CEST)
> X-SM_EnvelopeFrom: manojm.321@gmail.com
> X-SM_RECEIVED_ON: Sat, 15 Aug 2015 01:53:57 +0200 (CEST)
> X-SMScore: -970
> X-LCID: 9520600
> Received: from [(140.211.11.3)] by GTW-DMZ with Xeams SMTP; Sat, 15 Aug 2015 01:53:45 +0200 (CEST)
> X-SM_EnvelopeFrom: user-return-20396-bigdatagroup=itecons.it@hadoop.apache.org
> X-SM_RECEIVED_ON: Sat, 15 Aug 2015 01:53:45 +0200 (CEST)
> Received: (qmail 45474 invoked by uid 500); 14 Aug 2015 23:53:40 -0000
> Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
> Precedence: bulk
> List-Help: <ma...@hadoop.apache.org>
> List-Unsubscribe: <ma...@hadoop.apache.org>
> List-Post: <ma...@hadoop.apache.org>
> List-Id: <user.hadoop.apache.org>
> Reply-To: <us...@hadoop.apache.org>
> Delivered-To: mailing list user@hadoop.apache.org
> Received: (qmail 45460 invoked by uid 99); 14 Aug 2015 23:53:40 -0000
> Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142)
>     by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Aug 2015 23:53:40 +0000
> Received: from localhost (localhost [127.0.0.1])
> 	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id B1120DDE37
> 	for <us...@hadoop.apache.org>; Fri, 14 Aug 2015 23:53:39 +0000 (UTC)
> X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
> X-Spam-Flag: NO
> X-Spam-Score: 3.129
> X-Spam-Level: ***
> X-Spam-Status: No, score=3.129 tagged_above=-999 required=6.31
> 	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
> 	FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3,
> 	RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001]
> 	autolearn=disabled
> Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
> 	dkim=pass (2048-bit key) header.d=gmail.com
> Received: from mx1-eu-west.apache.org ([10.40.0.8])
> 	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
> 	with ESMTP id 7p9eDKlZHd2d for <us...@hadoop.apache.org>;
> 	Fri, 14 Aug 2015 23:53:37 +0000 (UTC)
> Received: from mail-oi0-f50.google.com (mail-oi0-f50.google.com [209.85.218.50])
> 	by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id D9CA42136C
> 	for <us...@hadoop.apache.org>; Fri, 14 Aug 2015 23:53:36 +0000 (UTC)
> Received: by oip136 with SMTP id 136so52683028oip.1
>         for <us...@hadoop.apache.org>; Fri, 14 Aug 2015 16:53:35 -0700 (PDT)
> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
>         d=gmail.com; s=20120113;
>         h=mime-version:date:message-id:subject:from:to:content-type;
>         bh=T1wf4bljaGb/zvLHrMaN7Q+F20Hqif18v22xBCeBLns=;
>         b=qy92Y3pzKefMnhUQIZnh4hS1+n8pN7c0RomzeWyzQZnTDroUk76CvZyxBt0nb+9YNb
>          oPMHKbQLWUvU+qE5N6tBJXu8uPEE0Rzju7n0XJ1AhgAO409atHt5lJsh9X0yz1CU3szK
>          oP70vwr33UlObl10O4lqBnFrVFAX9cK44zh3jYKxvO1gRxk4g5XnW2swmeDrldYf0eR4
>          eibUuU9H1j03RiTggrhVFOhuqs4zVxEIcn7KYDIXxtlkaq3RZMlRIAtg7e/aRttQcwbg
>          f3CuCa/zKtJTEKHCCI+3HQkneVeMHVcwe86UTl/jTDZ5sL0m7rJWSZdyLumhqDDaHmJ0
>          AhKA==
> MIME-Version: 1.0
> X-Received: by 10.202.48.200 with SMTP id w191mr13116197oiw.13.1439596415743;
>  Fri, 14 Aug 2015 16:53:35 -0700 (PDT)
> Received: by 10.182.22.170 with HTTP; Fri, 14 Aug 2015 16:53:35 -0700 (PDT)
> Date: Fri, 14 Aug 2015 16:53:35 -0700
> Message-ID: <CA...@mail.gmail.com>
> Subject: Map tasks keep Running even after the node is killed on Apache Yarn.
> From: manoj <ma...@gmail.com>
> To: <us...@hadoop.apache.org>
> Content-Type: multipart/alternative; boundary="001a113cd878fafe24051d4e28bf"
> Return-Path: manojm.321@gmail.com
>
>
> Final-Recipient: rfc822;bigdatagroup@itecons.it
> Action: failed
> Status: 5.2.2
> Diagnostic-Code: smtp;554 5.2.2 mailbox full;
> STOREDRV.Deliver.Exception:QuotaExceededException.MapiExceptionShutoffQuotaExceeded;
> Failed to process message due to a permanent exception with message Non
> riesco ad aprire la cassetta postale /o=Scai SpA/ou=Exchange Administrative
> Group (FYDIBOHF23SPDLT)/cn=Configuration/cn=Servers/cn=MAIL2/cn=Microsoft
> System Attendant. 16.55847:7E000000,
> 17.43559:0000000048010000000000002E00000000000000,
> 20.52176:000FE3832000000000000000, 20.50032:000FE3839017000000000000,
> 255.23226:000FE383, 255.27962:FE000000, 255.17082:DD040000,
> 0.26937:00000000, 4.21921:DD040000, 255.27962:FA000000, 255.1494:00000000,
> 0.50608:00000000,
> 5.29818:0000000034666631356633392D393430392D346230382D386264342D37666561303430643232623900000000,
> 1.29920:03000000, 7.29828:0A87B90E0000000000000000,
> 7.29832:0054B80E0000000000000000, 4.45884:DD040000, 4.29880:DD040000,
> 4.29888:DD040000, 1.56872:FE000000, 4.42712:DD040000,
> 5.10786:0000000031352E30302E303939352E3033323A6D61696C320000
>  0000, 255.1750:000FE383, 0.26849:00000000, 255.21817:DD040000,
> 0.26297:0A000000, 4.16585:DD040000, 0.32441:00000000, 4.1706:DD040000,
> 0.24761:0A000000, 4.20665:DD040000, 0.25785:00000000, 4.29881:DD040000
> [Stage: CreateSession]
> Remote-MTA: dns;MAIL2.scai.intra
>
>
>
> ---------- Forwarded message ----------
> From: manoj <ma...@gmail.com>
> To: <us...@hadoop.apache.org>
> Cc:
> Date: Fri, 14 Aug 2015 16:53:35 -0700
> Subject: Map tasks keep Running even after the node is killed on Apache
> Yarn.
> Hi,
>
> I'm on Apache2.6.0 YARN and I'm trying to test the dynamic addition and
> removal of nodes from the Cluster.
>
> The test starts a Job with 2 nodes and while the Job is progressing, It
> removes one of the node* by killing the dataNode and NodeManager Daemons.(
> is it ok to remove a node like this? )
>
> *this node is not running ResourceManager/ApplicationMaster for sure.
>
> After the node is successfully removed( I can confirm this from resource
> manager logs- attached) the test adds it back and waits till the job
> completes.
>
> Node Removal Logs:
>
> 2015-08-14 11:15:56,902 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:host172:36158 Timed out after 60 secs
> 2015-08-14 11:15:56,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node host172:36158 as it is now LOST
> 2015-08-14 11:15:56,904 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: host172:36158 Node Transitioned from RUNNING to LOST
> 2015-08-14 11:15:56,905 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1439575616861_0001_01_000006 Container Transitioned from RUNNING to KILLED
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1439575616861_0001_01_000006 in state: KILLED event:KILL
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop   OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1439575616861_0001    CONTAINERID=container_1439575616861_0001_01_000006
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1439575616861_0001_01_000006 of capacity <memory:1024, vCores:1> on host host172:36158, which currently has 1 containers, <memory:1024, vCores:1> used and <memory:1024, vCores:7> available, release resources=true
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=<memory:3584, vCores:3> numContainers=3 user=hadoop user-resources=<memory:3584, vCores:3>
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=Container: [ContainerId: container_1439575616861_0001_01_000006, NodeId: host172:36158, NodeHttpAddress: host172:8042, Resource: <memory:1024, vCores:1>, Priority: 20, Token: Token { kind: ContainerToken, service: XX.XX.0.2:36158 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:3584, vCores:3>, usedCapacity=1.75, absoluteUsedCapacity=1.75, numApps=1, numContainers=3 cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=1.75 absoluteUsedCapacity=1.75 used=<memory:3584, vCores:3> cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:3584, vCores:3>, usedCapacity=1.75, absoluteUsedCapacity=1.75, numApps=1, numContainers=3
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1439575616861_0001_000001 released container container_1439575616861_0001_01_000006 on node: host: host172:36158 #containers=1 available=1024 used=1024 with event: KILL
> 2015-08-14 11:15:56,907 INFO   org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1439575616861_0001_01_000005 Container Transitioned from RUNNING to KILLED
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1439575616861_0001_01_000005 in state: KILLED event:KILL
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop   OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1439575616861_0001    CONTAINERID=container_1439575616861_0001_01_000005
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1439575616861_0001_01_000005 of capacity <memory:1024, vCores:1> on host host172:36158, which currently has 0 containers, <memory:0, vCores:0> used and <memory:2048, vCores:8> available, release resources=true
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=<memory:2560, vCores:2> numContainers=2 user=hadoop user-resources=<memory:2560, vCores:2>
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=Container: [ContainerId: container_1439575616861_0001_01_000005, NodeId: host172:36158, NodeHttpAddress: host172:8042, Resource: <memory:1024, vCores:1>, Priority: 20, Token: Token { kind: ContainerToken, service: XX.XX.0.2:36158 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:2560, vCores:2>, usedCapacity=1.25, absoluteUsedCapacity=1.25, numApps=1, numContainers=2 cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=1.25 absoluteUsedCapacity=1.25 used=<memory:2560, vCores:2> cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:2560, vCores:2>, usedCapacity=1.25, absoluteUsedCapacity=1.25, numApps=1, numContainers=2
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1439575616861_0001_000001 released container container_1439575616861_0001_01_000005 on node: host: host172:36158 #containers=0 available=2048 used=0 with event: KILL
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Removed node host172:36158 clusterResource: <memory:2048, vCores:8>
>
> Node Addition logs:
>
> 2015-08-14 11:19:43,529 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved host172 to /default-rack
> 2015-08-14 11:19:43,530 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node host172(cmPort: 59426 httpPort: 8042) registered with capability: <memory:2048, vCores:8>, assigned nodeId host172:59426
> 2015-08-14 11:19:43,533 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: host172:59426 Node Transitioned from NEW to RUNNING
> 2015-08-14 11:19:43,535 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Added node host172:59426 clusterResource: <memory:4096, vCores:16>
>
> *Here's the problem:*
>
> The Job never completes! According to the logs the mapTasks which were
> scheduled on the node that was removed are still "RUNNING" with a
> mapProgress of 100%. These tasks stays in the same state forever.
>
> In the AppMasterContainer logs I see that it continuously tries to connect
> to the previous node host172/XX.XX.XX.XX:36158 though it was removed and
> added on a different port host172/XX.XX.XX.XX:59426
>
> ......
> ...
> 2015-08-14 11:25:21,662 INFO [ContainerLauncher #7] org.apache.hadoop.ipc.Client: Retrying connect to server: host172/XX.XX.XX.XX:36158. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
> ...
> ...
>
> Please let me know if you need to see any more logs.
>
> P.S: The Jobs completes normally without dynamic addition and removal of
> nodes on the same Cluster with same memory settings.
> Thanks,
> --Manoj Kumar M
>
>


-- 
--Manoj Kumar M

Re: Non recapitabile: Map tasks keep Running even after the node is killed on Apache Yarn.

Posted by manoj <ma...@gmail.com>.

Looks like the App master tries to connect to the Node that was removed for
~30Min before it gives up.
Can I reduce this wait time and number of tries?


Thanks
-Manoj

On Fri, Aug 14, 2015 at 4:54 PM, <po...@scai.it> wrote:

> *Il recapito non è riuscito per i seguenti destinatari o gruppi:*
>
> bigdatagroup@itecons.it
> La cassetta postale del destinatario è piena e non può accettare messaggi
> in questo momento. Riprova a inviare il messaggio più tardi.
>
> Il tuo messaggio è stato rifiutato dalla seguente organizzazione:
> MAIL2.scai.intra.
>
>
>
>
>
>
> *Informazioni di diagnostica per gli amministratori:*
>
> Server di generazione: mail2.scai.intra
>
> bigdatagroup@itecons.it
> MAIL2.scai.intra
> Remote Server returned '554 5.2.2 mailbox full;
> STOREDRV.Deliver.Exception:QuotaExceededException.MapiExceptionShutoffQuotaExceeded;
> Failed to process message due to a permanent exception with message Non
> riesco ad aprire la cassetta postale /o=Scai SpA/ou=Exchange Administrative
> Group (FYDIBOHF23SPDLT)/cn=Configuration/cn=Servers/cn=MAIL2/cn=Microsoft
> System Attendant. 16.55847:7E000000,
> 17.43559:0000000048010000000000002E00000000000000,
> 20.52176:000FE3832000000000000000, 20.50032:000FE3839017000000000000,
> 255.23226:000FE383, 255.27962:FE000000, 255.17082:DD040000,
> 0.26937:00000000, 4.21921:DD040000, 255.27962:FA000000, 255.1494:00000000,
> 0.50608:00000000,
> 5.29818:0000000034666631356633392D393430392D346230382D386264342D37666561303430643232623900000000,
> 1.29920:03000000, 7.29828:0A87B90E0000000000000000,
> 7.29832:0054B80E0000000000000000, 4.45884:DD040000, 4.29880:DD040000,
> 4.29888:DD040000, 1.56872:FE000000, 4.42712:DD040000,
> 5.10786:0000000031352E30302E303939352E3033323A6D61696C3200000000,
> 255.1750:000FE383, 0.26849:00000000, 255.21817:DD040000, 0.26297:0A000000,
> 4.16585:DD040000, 0.32441:00000000, 4.1706:DD040000, 0.24761:0A000000,
> 4.20665:DD040000, 0.25785:00000000, 4.29881:DD040000 [Stage: CreateSession]'
>
> Intestazioni originali del messaggio:
>
> Received: from MAIL2.scai.intra (10.110.4.14) by mail2.scai.intra
>  (10.110.4.14) with Microsoft SMTP Server (TLS) id 15.0.995.29; Sat, 15 Aug
>  2015 01:54:02 +0200
> Received: from mail.grupposcai.it (10.110.4.1) by MAIL2.scai.intra
>  (10.110.4.14) with Microsoft SMTP Server id 15.0.995.29 via Frontend
>  Transport; Sat, 15 Aug 2015 01:54:02 +0200
> X-BYPSHEADER: 12196482
> X-SMScore: -100000
> X-LCID: 9520602
> Received: from [(10.110.4.1)] by GTW-DMZ with Xeams SMTP; Sat, 15 Aug 2015 01:53:57 +0200 (CEST)
> X-SM_EnvelopeFrom: manojm.321@gmail.com
> X-SM_RECEIVED_ON: Sat, 15 Aug 2015 01:53:57 +0200 (CEST)
> X-SMScore: -970
> X-LCID: 9520600
> Received: from [(140.211.11.3)] by GTW-DMZ with Xeams SMTP; Sat, 15 Aug 2015 01:53:45 +0200 (CEST)
> X-SM_EnvelopeFrom: user-return-20396-bigdatagroup=itecons.it@hadoop.apache.org
> X-SM_RECEIVED_ON: Sat, 15 Aug 2015 01:53:45 +0200 (CEST)
> Received: (qmail 45474 invoked by uid 500); 14 Aug 2015 23:53:40 -0000
> Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
> Precedence: bulk
> List-Help: <ma...@hadoop.apache.org>
> List-Unsubscribe: <ma...@hadoop.apache.org>
> List-Post: <ma...@hadoop.apache.org>
> List-Id: <user.hadoop.apache.org>
> Reply-To: <us...@hadoop.apache.org>
> Delivered-To: mailing list user@hadoop.apache.org
> Received: (qmail 45460 invoked by uid 99); 14 Aug 2015 23:53:40 -0000
> Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142)
>     by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Aug 2015 23:53:40 +0000
> Received: from localhost (localhost [127.0.0.1])
> 	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id B1120DDE37
> 	for <us...@hadoop.apache.org>; Fri, 14 Aug 2015 23:53:39 +0000 (UTC)
> X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
> X-Spam-Flag: NO
> X-Spam-Score: 3.129
> X-Spam-Level: ***
> X-Spam-Status: No, score=3.129 tagged_above=-999 required=6.31
> 	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
> 	FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3,
> 	RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001]
> 	autolearn=disabled
> Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
> 	dkim=pass (2048-bit key) header.d=gmail.com
> Received: from mx1-eu-west.apache.org ([10.40.0.8])
> 	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
> 	with ESMTP id 7p9eDKlZHd2d for <us...@hadoop.apache.org>;
> 	Fri, 14 Aug 2015 23:53:37 +0000 (UTC)
> Received: from mail-oi0-f50.google.com (mail-oi0-f50.google.com [209.85.218.50])
> 	by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id D9CA42136C
> 	for <us...@hadoop.apache.org>; Fri, 14 Aug 2015 23:53:36 +0000 (UTC)
> Received: by oip136 with SMTP id 136so52683028oip.1
>         for <us...@hadoop.apache.org>; Fri, 14 Aug 2015 16:53:35 -0700 (PDT)
> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
>         d=gmail.com; s=20120113;
>         h=mime-version:date:message-id:subject:from:to:content-type;
>         bh=T1wf4bljaGb/zvLHrMaN7Q+F20Hqif18v22xBCeBLns=;
>         b=qy92Y3pzKefMnhUQIZnh4hS1+n8pN7c0RomzeWyzQZnTDroUk76CvZyxBt0nb+9YNb
>          oPMHKbQLWUvU+qE5N6tBJXu8uPEE0Rzju7n0XJ1AhgAO409atHt5lJsh9X0yz1CU3szK
>          oP70vwr33UlObl10O4lqBnFrVFAX9cK44zh3jYKxvO1gRxk4g5XnW2swmeDrldYf0eR4
>          eibUuU9H1j03RiTggrhVFOhuqs4zVxEIcn7KYDIXxtlkaq3RZMlRIAtg7e/aRttQcwbg
>          f3CuCa/zKtJTEKHCCI+3HQkneVeMHVcwe86UTl/jTDZ5sL0m7rJWSZdyLumhqDDaHmJ0
>          AhKA==
> MIME-Version: 1.0
> X-Received: by 10.202.48.200 with SMTP id w191mr13116197oiw.13.1439596415743;
>  Fri, 14 Aug 2015 16:53:35 -0700 (PDT)
> Received: by 10.182.22.170 with HTTP; Fri, 14 Aug 2015 16:53:35 -0700 (PDT)
> Date: Fri, 14 Aug 2015 16:53:35 -0700
> Message-ID: <CA...@mail.gmail.com>
> Subject: Map tasks keep Running even after the node is killed on Apache Yarn.
> From: manoj <ma...@gmail.com>
> To: <us...@hadoop.apache.org>
> Content-Type: multipart/alternative; boundary="001a113cd878fafe24051d4e28bf"
> Return-Path: manojm.321@gmail.com
>
>
> Final-Recipient: rfc822;bigdatagroup@itecons.it
> Action: failed
> Status: 5.2.2
> Diagnostic-Code: smtp;554 5.2.2 mailbox full;
> STOREDRV.Deliver.Exception:QuotaExceededException.MapiExceptionShutoffQuotaExceeded;
> Failed to process message due to a permanent exception with message Non
> riesco ad aprire la cassetta postale /o=Scai SpA/ou=Exchange Administrative
> Group (FYDIBOHF23SPDLT)/cn=Configuration/cn=Servers/cn=MAIL2/cn=Microsoft
> System Attendant. 16.55847:7E000000,
> 17.43559:0000000048010000000000002E00000000000000,
> 20.52176:000FE3832000000000000000, 20.50032:000FE3839017000000000000,
> 255.23226:000FE383, 255.27962:FE000000, 255.17082:DD040000,
> 0.26937:00000000, 4.21921:DD040000, 255.27962:FA000000, 255.1494:00000000,
> 0.50608:00000000,
> 5.29818:0000000034666631356633392D393430392D346230382D386264342D37666561303430643232623900000000,
> 1.29920:03000000, 7.29828:0A87B90E0000000000000000,
> 7.29832:0054B80E0000000000000000, 4.45884:DD040000, 4.29880:DD040000,
> 4.29888:DD040000, 1.56872:FE000000, 4.42712:DD040000,
> 5.10786:0000000031352E30302E303939352E3033323A6D61696C320000
>  0000, 255.1750:000FE383, 0.26849:00000000, 255.21817:DD040000,
> 0.26297:0A000000, 4.16585:DD040000, 0.32441:00000000, 4.1706:DD040000,
> 0.24761:0A000000, 4.20665:DD040000, 0.25785:00000000, 4.29881:DD040000
> [Stage: CreateSession]
> Remote-MTA: dns;MAIL2.scai.intra
>
>
>
> ---------- Forwarded message ----------
> From: manoj <ma...@gmail.com>
> To: <us...@hadoop.apache.org>
> Cc:
> Date: Fri, 14 Aug 2015 16:53:35 -0700
> Subject: Map tasks keep Running even after the node is killed on Apache
> Yarn.
> Hi,
>
> I'm on Apache2.6.0 YARN and I'm trying to test the dynamic addition and
> removal of nodes from the Cluster.
>
> The test starts a Job with 2 nodes and while the Job is progressing, It
> removes one of the node* by killing the dataNode and NodeManager Daemons.(
> is it ok to remove a node like this? )
>
> *this node is not running ResourceManager/ApplicationMaster for sure.
>
> After the node is successfully removed( I can confirm this from resource
> manager logs- attached) the test adds it back and waits till the job
> completes.
>
> Node Removal Logs:
>
> 2015-08-14 11:15:56,902 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:host172:36158 Timed out after 60 secs
> 2015-08-14 11:15:56,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node host172:36158 as it is now LOST
> 2015-08-14 11:15:56,904 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: host172:36158 Node Transitioned from RUNNING to LOST
> 2015-08-14 11:15:56,905 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1439575616861_0001_01_000006 Container Transitioned from RUNNING to KILLED
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1439575616861_0001_01_000006 in state: KILLED event:KILL
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop   OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1439575616861_0001    CONTAINERID=container_1439575616861_0001_01_000006
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1439575616861_0001_01_000006 of capacity <memory:1024, vCores:1> on host host172:36158, which currently has 1 containers, <memory:1024, vCores:1> used and <memory:1024, vCores:7> available, release resources=true
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=<memory:3584, vCores:3> numContainers=3 user=hadoop user-resources=<memory:3584, vCores:3>
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=Container: [ContainerId: container_1439575616861_0001_01_000006, NodeId: host172:36158, NodeHttpAddress: host172:8042, Resource: <memory:1024, vCores:1>, Priority: 20, Token: Token { kind: ContainerToken, service: XX.XX.0.2:36158 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:3584, vCores:3>, usedCapacity=1.75, absoluteUsedCapacity=1.75, numApps=1, numContainers=3 cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=1.75 absoluteUsedCapacity=1.75 used=<memory:3584, vCores:3> cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:3584, vCores:3>, usedCapacity=1.75, absoluteUsedCapacity=1.75, numApps=1, numContainers=3
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1439575616861_0001_000001 released container container_1439575616861_0001_01_000006 on node: host: host172:36158 #containers=1 available=1024 used=1024 with event: KILL
> 2015-08-14 11:15:56,907 INFO   org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1439575616861_0001_01_000005 Container Transitioned from RUNNING to KILLED
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1439575616861_0001_01_000005 in state: KILLED event:KILL
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop   OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1439575616861_0001    CONTAINERID=container_1439575616861_0001_01_000005
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1439575616861_0001_01_000005 of capacity <memory:1024, vCores:1> on host host172:36158, which currently has 0 containers, <memory:0, vCores:0> used and <memory:2048, vCores:8> available, release resources=true
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=<memory:2560, vCores:2> numContainers=2 user=hadoop user-resources=<memory:2560, vCores:2>
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=Container: [ContainerId: container_1439575616861_0001_01_000005, NodeId: host172:36158, NodeHttpAddress: host172:8042, Resource: <memory:1024, vCores:1>, Priority: 20, Token: Token { kind: ContainerToken, service: XX.XX.0.2:36158 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:2560, vCores:2>, usedCapacity=1.25, absoluteUsedCapacity=1.25, numApps=1, numContainers=2 cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=1.25 absoluteUsedCapacity=1.25 used=<memory:2560, vCores:2> cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:2560, vCores:2>, usedCapacity=1.25, absoluteUsedCapacity=1.25, numApps=1, numContainers=2
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1439575616861_0001_000001 released container container_1439575616861_0001_01_000005 on node: host: host172:36158 #containers=0 available=2048 used=0 with event: KILL
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Removed node host172:36158 clusterResource: <memory:2048, vCores:8>
>
> Node Addition logs:
>
> 2015-08-14 11:19:43,529 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved host172 to /default-rack
> 2015-08-14 11:19:43,530 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node host172(cmPort: 59426 httpPort: 8042) registered with capability: <memory:2048, vCores:8>, assigned nodeId host172:59426
> 2015-08-14 11:19:43,533 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: host172:59426 Node Transitioned from NEW to RUNNING
> 2015-08-14 11:19:43,535 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Added node host172:59426 clusterResource: <memory:4096, vCores:16>
>
> *Here's the problem:*
>
> The Job never completes! According to the logs the mapTasks which were
> scheduled on the node that was removed are still "RUNNING" with a
> mapProgress of 100%. These tasks stays in the same state forever.
>
> In the AppMasterContainer logs I see that it continuously tries to connect
> to the previous node host172/XX.XX.XX.XX:36158 though it was removed and
> added on a different port host172/XX.XX.XX.XX:59426
>
> ......
> ...
> 2015-08-14 11:25:21,662 INFO [ContainerLauncher #7] org.apache.hadoop.ipc.Client: Retrying connect to server: host172/XX.XX.XX.XX:36158. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
> ...
> ...
>
> Please let me know if you need to see any more logs.
>
> P.S: The Jobs completes normally without dynamic addition and removal of
> nodes on the same Cluster with same memory settings.
> Thanks,
> --Manoj Kumar M
>
>


-- 
--Manoj Kumar M

Re: Non recapitabile: Map tasks keep Running even after the node is killed on Apache Yarn.

Posted by manoj <ma...@gmail.com>.

Looks like the App master tries to connect to the Node that was removed for
~30Min before it gives up.
Can I reduce this wait time and number of tries?


Thanks
-Manoj

On Fri, Aug 14, 2015 at 4:54 PM, <po...@scai.it> wrote:

> *Il recapito non è riuscito per i seguenti destinatari o gruppi:*
>
> bigdatagroup@itecons.it
> La cassetta postale del destinatario è piena e non può accettare messaggi
> in questo momento. Riprova a inviare il messaggio più tardi.
>
> Il tuo messaggio è stato rifiutato dalla seguente organizzazione:
> MAIL2.scai.intra.
>
>
>
>
>
>
> *Informazioni di diagnostica per gli amministratori:*
>
> Server di generazione: mail2.scai.intra
>
> bigdatagroup@itecons.it
> MAIL2.scai.intra
> Remote Server returned '554 5.2.2 mailbox full;
> STOREDRV.Deliver.Exception:QuotaExceededException.MapiExceptionShutoffQuotaExceeded;
> Failed to process message due to a permanent exception with message Non
> riesco ad aprire la cassetta postale /o=Scai SpA/ou=Exchange Administrative
> Group (FYDIBOHF23SPDLT)/cn=Configuration/cn=Servers/cn=MAIL2/cn=Microsoft
> System Attendant. 16.55847:7E000000,
> 17.43559:0000000048010000000000002E00000000000000,
> 20.52176:000FE3832000000000000000, 20.50032:000FE3839017000000000000,
> 255.23226:000FE383, 255.27962:FE000000, 255.17082:DD040000,
> 0.26937:00000000, 4.21921:DD040000, 255.27962:FA000000, 255.1494:00000000,
> 0.50608:00000000,
> 5.29818:0000000034666631356633392D393430392D346230382D386264342D37666561303430643232623900000000,
> 1.29920:03000000, 7.29828:0A87B90E0000000000000000,
> 7.29832:0054B80E0000000000000000, 4.45884:DD040000, 4.29880:DD040000,
> 4.29888:DD040000, 1.56872:FE000000, 4.42712:DD040000,
> 5.10786:0000000031352E30302E303939352E3033323A6D61696C3200000000,
> 255.1750:000FE383, 0.26849:00000000, 255.21817:DD040000, 0.26297:0A000000,
> 4.16585:DD040000, 0.32441:00000000, 4.1706:DD040000, 0.24761:0A000000,
> 4.20665:DD040000, 0.25785:00000000, 4.29881:DD040000 [Stage: CreateSession]'
>
> Intestazioni originali del messaggio:
>
> Received: from MAIL2.scai.intra (10.110.4.14) by mail2.scai.intra
>  (10.110.4.14) with Microsoft SMTP Server (TLS) id 15.0.995.29; Sat, 15 Aug
>  2015 01:54:02 +0200
> Received: from mail.grupposcai.it (10.110.4.1) by MAIL2.scai.intra
>  (10.110.4.14) with Microsoft SMTP Server id 15.0.995.29 via Frontend
>  Transport; Sat, 15 Aug 2015 01:54:02 +0200
> X-BYPSHEADER: 12196482
> X-SMScore: -100000
> X-LCID: 9520602
> Received: from [(10.110.4.1)] by GTW-DMZ with Xeams SMTP; Sat, 15 Aug 2015 01:53:57 +0200 (CEST)
> X-SM_EnvelopeFrom: manojm.321@gmail.com
> X-SM_RECEIVED_ON: Sat, 15 Aug 2015 01:53:57 +0200 (CEST)
> X-SMScore: -970
> X-LCID: 9520600
> Received: from [(140.211.11.3)] by GTW-DMZ with Xeams SMTP; Sat, 15 Aug 2015 01:53:45 +0200 (CEST)
> X-SM_EnvelopeFrom: user-return-20396-bigdatagroup=itecons.it@hadoop.apache.org
> X-SM_RECEIVED_ON: Sat, 15 Aug 2015 01:53:45 +0200 (CEST)
> Received: (qmail 45474 invoked by uid 500); 14 Aug 2015 23:53:40 -0000
> Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
> Precedence: bulk
> List-Help: <ma...@hadoop.apache.org>
> List-Unsubscribe: <ma...@hadoop.apache.org>
> List-Post: <ma...@hadoop.apache.org>
> List-Id: <user.hadoop.apache.org>
> Reply-To: <us...@hadoop.apache.org>
> Delivered-To: mailing list user@hadoop.apache.org
> Received: (qmail 45460 invoked by uid 99); 14 Aug 2015 23:53:40 -0000
> Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142)
>     by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Aug 2015 23:53:40 +0000
> Received: from localhost (localhost [127.0.0.1])
> 	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id B1120DDE37
> 	for <us...@hadoop.apache.org>; Fri, 14 Aug 2015 23:53:39 +0000 (UTC)
> X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
> X-Spam-Flag: NO
> X-Spam-Score: 3.129
> X-Spam-Level: ***
> X-Spam-Status: No, score=3.129 tagged_above=-999 required=6.31
> 	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
> 	FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3,
> 	RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001]
> 	autolearn=disabled
> Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
> 	dkim=pass (2048-bit key) header.d=gmail.com
> Received: from mx1-eu-west.apache.org ([10.40.0.8])
> 	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
> 	with ESMTP id 7p9eDKlZHd2d for <us...@hadoop.apache.org>;
> 	Fri, 14 Aug 2015 23:53:37 +0000 (UTC)
> Received: from mail-oi0-f50.google.com (mail-oi0-f50.google.com [209.85.218.50])
> 	by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id D9CA42136C
> 	for <us...@hadoop.apache.org>; Fri, 14 Aug 2015 23:53:36 +0000 (UTC)
> Received: by oip136 with SMTP id 136so52683028oip.1
>         for <us...@hadoop.apache.org>; Fri, 14 Aug 2015 16:53:35 -0700 (PDT)
> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
>         d=gmail.com; s=20120113;
>         h=mime-version:date:message-id:subject:from:to:content-type;
>         bh=T1wf4bljaGb/zvLHrMaN7Q+F20Hqif18v22xBCeBLns=;
>         b=qy92Y3pzKefMnhUQIZnh4hS1+n8pN7c0RomzeWyzQZnTDroUk76CvZyxBt0nb+9YNb
>          oPMHKbQLWUvU+qE5N6tBJXu8uPEE0Rzju7n0XJ1AhgAO409atHt5lJsh9X0yz1CU3szK
>          oP70vwr33UlObl10O4lqBnFrVFAX9cK44zh3jYKxvO1gRxk4g5XnW2swmeDrldYf0eR4
>          eibUuU9H1j03RiTggrhVFOhuqs4zVxEIcn7KYDIXxtlkaq3RZMlRIAtg7e/aRttQcwbg
>          f3CuCa/zKtJTEKHCCI+3HQkneVeMHVcwe86UTl/jTDZ5sL0m7rJWSZdyLumhqDDaHmJ0
>          AhKA==
> MIME-Version: 1.0
> X-Received: by 10.202.48.200 with SMTP id w191mr13116197oiw.13.1439596415743;
>  Fri, 14 Aug 2015 16:53:35 -0700 (PDT)
> Received: by 10.182.22.170 with HTTP; Fri, 14 Aug 2015 16:53:35 -0700 (PDT)
> Date: Fri, 14 Aug 2015 16:53:35 -0700
> Message-ID: <CA...@mail.gmail.com>
> Subject: Map tasks keep Running even after the node is killed on Apache Yarn.
> From: manoj <ma...@gmail.com>
> To: <us...@hadoop.apache.org>
> Content-Type: multipart/alternative; boundary="001a113cd878fafe24051d4e28bf"
> Return-Path: manojm.321@gmail.com
>
>
> Final-Recipient: rfc822;bigdatagroup@itecons.it
> Action: failed
> Status: 5.2.2
> Diagnostic-Code: smtp;554 5.2.2 mailbox full;
> STOREDRV.Deliver.Exception:QuotaExceededException.MapiExceptionShutoffQuotaExceeded;
> Failed to process message due to a permanent exception with message Non
> riesco ad aprire la cassetta postale /o=Scai SpA/ou=Exchange Administrative
> Group (FYDIBOHF23SPDLT)/cn=Configuration/cn=Servers/cn=MAIL2/cn=Microsoft
> System Attendant. 16.55847:7E000000,
> 17.43559:0000000048010000000000002E00000000000000,
> 20.52176:000FE3832000000000000000, 20.50032:000FE3839017000000000000,
> 255.23226:000FE383, 255.27962:FE000000, 255.17082:DD040000,
> 0.26937:00000000, 4.21921:DD040000, 255.27962:FA000000, 255.1494:00000000,
> 0.50608:00000000,
> 5.29818:0000000034666631356633392D393430392D346230382D386264342D37666561303430643232623900000000,
> 1.29920:03000000, 7.29828:0A87B90E0000000000000000,
> 7.29832:0054B80E0000000000000000, 4.45884:DD040000, 4.29880:DD040000,
> 4.29888:DD040000, 1.56872:FE000000, 4.42712:DD040000,
> 5.10786:0000000031352E30302E303939352E3033323A6D61696C320000
>  0000, 255.1750:000FE383, 0.26849:00000000, 255.21817:DD040000,
> 0.26297:0A000000, 4.16585:DD040000, 0.32441:00000000, 4.1706:DD040000,
> 0.24761:0A000000, 4.20665:DD040000, 0.25785:00000000, 4.29881:DD040000
> [Stage: CreateSession]
> Remote-MTA: dns;MAIL2.scai.intra
>
>
>
> ---------- Forwarded message ----------
> From: manoj <ma...@gmail.com>
> To: <us...@hadoop.apache.org>
> Cc:
> Date: Fri, 14 Aug 2015 16:53:35 -0700
> Subject: Map tasks keep Running even after the node is killed on Apache
> Yarn.
> Hi,
>
> I'm on Apache2.6.0 YARN and I'm trying to test the dynamic addition and
> removal of nodes from the Cluster.
>
> The test starts a Job with 2 nodes and while the Job is progressing, It
> removes one of the node* by killing the dataNode and NodeManager Daemons.(
> is it ok to remove a node like this? )
>
> *this node is not running ResourceManager/ApplicationMaster for sure.
>
> After the node is successfully removed( I can confirm this from resource
> manager logs- attached) the test adds it back and waits till the job
> completes.
>
> Node Removal Logs:
>
> 2015-08-14 11:15:56,902 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:host172:36158 Timed out after 60 secs
> 2015-08-14 11:15:56,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node host172:36158 as it is now LOST
> 2015-08-14 11:15:56,904 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: host172:36158 Node Transitioned from RUNNING to LOST
> 2015-08-14 11:15:56,905 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1439575616861_0001_01_000006 Container Transitioned from RUNNING to KILLED
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1439575616861_0001_01_000006 in state: KILLED event:KILL
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop   OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1439575616861_0001    CONTAINERID=container_1439575616861_0001_01_000006
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1439575616861_0001_01_000006 of capacity <memory:1024, vCores:1> on host host172:36158, which currently has 1 containers, <memory:1024, vCores:1> used and <memory:1024, vCores:7> available, release resources=true
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=<memory:3584, vCores:3> numContainers=3 user=hadoop user-resources=<memory:3584, vCores:3>
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=Container: [ContainerId: container_1439575616861_0001_01_000006, NodeId: host172:36158, NodeHttpAddress: host172:8042, Resource: <memory:1024, vCores:1>, Priority: 20, Token: Token { kind: ContainerToken, service: XX.XX.0.2:36158 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:3584, vCores:3>, usedCapacity=1.75, absoluteUsedCapacity=1.75, numApps=1, numContainers=3 cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=1.75 absoluteUsedCapacity=1.75 used=<memory:3584, vCores:3> cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:3584, vCores:3>, usedCapacity=1.75, absoluteUsedCapacity=1.75, numApps=1, numContainers=3
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1439575616861_0001_000001 released container container_1439575616861_0001_01_000006 on node: host: host172:36158 #containers=1 available=1024 used=1024 with event: KILL
> 2015-08-14 11:15:56,907 INFO   org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1439575616861_0001_01_000005 Container Transitioned from RUNNING to KILLED
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1439575616861_0001_01_000005 in state: KILLED event:KILL
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop   OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1439575616861_0001    CONTAINERID=container_1439575616861_0001_01_000005
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1439575616861_0001_01_000005 of capacity <memory:1024, vCores:1> on host host172:36158, which currently has 0 containers, <memory:0, vCores:0> used and <memory:2048, vCores:8> available, release resources=true
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=<memory:2560, vCores:2> numContainers=2 user=hadoop user-resources=<memory:2560, vCores:2>
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=Container: [ContainerId: container_1439575616861_0001_01_000005, NodeId: host172:36158, NodeHttpAddress: host172:8042, Resource: <memory:1024, vCores:1>, Priority: 20, Token: Token { kind: ContainerToken, service: XX.XX.0.2:36158 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:2560, vCores:2>, usedCapacity=1.25, absoluteUsedCapacity=1.25, numApps=1, numContainers=2 cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=1.25 absoluteUsedCapacity=1.25 used=<memory:2560, vCores:2> cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:2560, vCores:2>, usedCapacity=1.25, absoluteUsedCapacity=1.25, numApps=1, numContainers=2
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1439575616861_0001_000001 released container container_1439575616861_0001_01_000005 on node: host: host172:36158 #containers=0 available=2048 used=0 with event: KILL
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Removed node host172:36158 clusterResource: <memory:2048, vCores:8>
>
> Node Addition logs:
>
> 2015-08-14 11:19:43,529 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved host172 to /default-rack
> 2015-08-14 11:19:43,530 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node host172(cmPort: 59426 httpPort: 8042) registered with capability: <memory:2048, vCores:8>, assigned nodeId host172:59426
> 2015-08-14 11:19:43,533 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: host172:59426 Node Transitioned from NEW to RUNNING
> 2015-08-14 11:19:43,535 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Added node host172:59426 clusterResource: <memory:4096, vCores:16>
>
> *Here's the problem:*
>
> The Job never completes! According to the logs the mapTasks which were
> scheduled on the node that was removed are still "RUNNING" with a
> mapProgress of 100%. These tasks stays in the same state forever.
>
> In the AppMasterContainer logs I see that it continuously tries to connect
> to the previous node host172/XX.XX.XX.XX:36158 though it was removed and
> added on a different port host172/XX.XX.XX.XX:59426
>
> ......
> ...
> 2015-08-14 11:25:21,662 INFO [ContainerLauncher #7] org.apache.hadoop.ipc.Client: Retrying connect to server: host172/XX.XX.XX.XX:36158. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
> ...
> ...
>
> Please let me know if you need to see any more logs.
>
> P.S: The Jobs completes normally without dynamic addition and removal of
> nodes on the same Cluster with same memory settings.
> Thanks,
> --Manoj Kumar M
>
>


-- 
--Manoj Kumar M

Re: Non recapitabile: Map tasks keep Running even after the node is killed on Apache Yarn.

Posted by manoj <ma...@gmail.com>.

Looks like the App master tries to connect to the Node that was removed for
~30Min before it gives up.
Can I reduce this wait time and number of tries?


Thanks
-Manoj

On Fri, Aug 14, 2015 at 4:54 PM, <po...@scai.it> wrote:

> *Il recapito non è riuscito per i seguenti destinatari o gruppi:*
>
> bigdatagroup@itecons.it
> La cassetta postale del destinatario è piena e non può accettare messaggi
> in questo momento. Riprova a inviare il messaggio più tardi.
>
> Il tuo messaggio è stato rifiutato dalla seguente organizzazione:
> MAIL2.scai.intra.
>
>
>
>
>
>
> *Informazioni di diagnostica per gli amministratori:*
>
> Server di generazione: mail2.scai.intra
>
> bigdatagroup@itecons.it
> MAIL2.scai.intra
> Remote Server returned '554 5.2.2 mailbox full;
> STOREDRV.Deliver.Exception:QuotaExceededException.MapiExceptionShutoffQuotaExceeded;
> Failed to process message due to a permanent exception with message Non
> riesco ad aprire la cassetta postale /o=Scai SpA/ou=Exchange Administrative
> Group (FYDIBOHF23SPDLT)/cn=Configuration/cn=Servers/cn=MAIL2/cn=Microsoft
> System Attendant. 16.55847:7E000000,
> 17.43559:0000000048010000000000002E00000000000000,
> 20.52176:000FE3832000000000000000, 20.50032:000FE3839017000000000000,
> 255.23226:000FE383, 255.27962:FE000000, 255.17082:DD040000,
> 0.26937:00000000, 4.21921:DD040000, 255.27962:FA000000, 255.1494:00000000,
> 0.50608:00000000,
> 5.29818:0000000034666631356633392D393430392D346230382D386264342D37666561303430643232623900000000,
> 1.29920:03000000, 7.29828:0A87B90E0000000000000000,
> 7.29832:0054B80E0000000000000000, 4.45884:DD040000, 4.29880:DD040000,
> 4.29888:DD040000, 1.56872:FE000000, 4.42712:DD040000,
> 5.10786:0000000031352E30302E303939352E3033323A6D61696C3200000000,
> 255.1750:000FE383, 0.26849:00000000, 255.21817:DD040000, 0.26297:0A000000,
> 4.16585:DD040000, 0.32441:00000000, 4.1706:DD040000, 0.24761:0A000000,
> 4.20665:DD040000, 0.25785:00000000, 4.29881:DD040000 [Stage: CreateSession]'
>
> Intestazioni originali del messaggio:
>
> Received: from MAIL2.scai.intra (10.110.4.14) by mail2.scai.intra
>  (10.110.4.14) with Microsoft SMTP Server (TLS) id 15.0.995.29; Sat, 15 Aug
>  2015 01:54:02 +0200
> Received: from mail.grupposcai.it (10.110.4.1) by MAIL2.scai.intra
>  (10.110.4.14) with Microsoft SMTP Server id 15.0.995.29 via Frontend
>  Transport; Sat, 15 Aug 2015 01:54:02 +0200
> X-BYPSHEADER: 12196482
> X-SMScore: -100000
> X-LCID: 9520602
> Received: from [(10.110.4.1)] by GTW-DMZ with Xeams SMTP; Sat, 15 Aug 2015 01:53:57 +0200 (CEST)
> X-SM_EnvelopeFrom: manojm.321@gmail.com
> X-SM_RECEIVED_ON: Sat, 15 Aug 2015 01:53:57 +0200 (CEST)
> X-SMScore: -970
> X-LCID: 9520600
> Received: from [(140.211.11.3)] by GTW-DMZ with Xeams SMTP; Sat, 15 Aug 2015 01:53:45 +0200 (CEST)
> X-SM_EnvelopeFrom: user-return-20396-bigdatagroup=itecons.it@hadoop.apache.org
> X-SM_RECEIVED_ON: Sat, 15 Aug 2015 01:53:45 +0200 (CEST)
> Received: (qmail 45474 invoked by uid 500); 14 Aug 2015 23:53:40 -0000
> Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
> Precedence: bulk
> List-Help: <ma...@hadoop.apache.org>
> List-Unsubscribe: <ma...@hadoop.apache.org>
> List-Post: <ma...@hadoop.apache.org>
> List-Id: <user.hadoop.apache.org>
> Reply-To: <us...@hadoop.apache.org>
> Delivered-To: mailing list user@hadoop.apache.org
> Received: (qmail 45460 invoked by uid 99); 14 Aug 2015 23:53:40 -0000
> Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142)
>     by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Aug 2015 23:53:40 +0000
> Received: from localhost (localhost [127.0.0.1])
> 	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id B1120DDE37
> 	for <us...@hadoop.apache.org>; Fri, 14 Aug 2015 23:53:39 +0000 (UTC)
> X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
> X-Spam-Flag: NO
> X-Spam-Score: 3.129
> X-Spam-Level: ***
> X-Spam-Status: No, score=3.129 tagged_above=-999 required=6.31
> 	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
> 	FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3,
> 	RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001]
> 	autolearn=disabled
> Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
> 	dkim=pass (2048-bit key) header.d=gmail.com
> Received: from mx1-eu-west.apache.org ([10.40.0.8])
> 	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
> 	with ESMTP id 7p9eDKlZHd2d for <us...@hadoop.apache.org>;
> 	Fri, 14 Aug 2015 23:53:37 +0000 (UTC)
> Received: from mail-oi0-f50.google.com (mail-oi0-f50.google.com [209.85.218.50])
> 	by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id D9CA42136C
> 	for <us...@hadoop.apache.org>; Fri, 14 Aug 2015 23:53:36 +0000 (UTC)
> Received: by oip136 with SMTP id 136so52683028oip.1
>         for <us...@hadoop.apache.org>; Fri, 14 Aug 2015 16:53:35 -0700 (PDT)
> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
>         d=gmail.com; s=20120113;
>         h=mime-version:date:message-id:subject:from:to:content-type;
>         bh=T1wf4bljaGb/zvLHrMaN7Q+F20Hqif18v22xBCeBLns=;
>         b=qy92Y3pzKefMnhUQIZnh4hS1+n8pN7c0RomzeWyzQZnTDroUk76CvZyxBt0nb+9YNb
>          oPMHKbQLWUvU+qE5N6tBJXu8uPEE0Rzju7n0XJ1AhgAO409atHt5lJsh9X0yz1CU3szK
>          oP70vwr33UlObl10O4lqBnFrVFAX9cK44zh3jYKxvO1gRxk4g5XnW2swmeDrldYf0eR4
>          eibUuU9H1j03RiTggrhVFOhuqs4zVxEIcn7KYDIXxtlkaq3RZMlRIAtg7e/aRttQcwbg
>          f3CuCa/zKtJTEKHCCI+3HQkneVeMHVcwe86UTl/jTDZ5sL0m7rJWSZdyLumhqDDaHmJ0
>          AhKA==
> MIME-Version: 1.0
> X-Received: by 10.202.48.200 with SMTP id w191mr13116197oiw.13.1439596415743;
>  Fri, 14 Aug 2015 16:53:35 -0700 (PDT)
> Received: by 10.182.22.170 with HTTP; Fri, 14 Aug 2015 16:53:35 -0700 (PDT)
> Date: Fri, 14 Aug 2015 16:53:35 -0700
> Message-ID: <CA...@mail.gmail.com>
> Subject: Map tasks keep Running even after the node is killed on Apache Yarn.
> From: manoj <ma...@gmail.com>
> To: <us...@hadoop.apache.org>
> Content-Type: multipart/alternative; boundary="001a113cd878fafe24051d4e28bf"
> Return-Path: manojm.321@gmail.com
>
>
> Final-Recipient: rfc822;bigdatagroup@itecons.it
> Action: failed
> Status: 5.2.2
> Diagnostic-Code: smtp;554 5.2.2 mailbox full;
> STOREDRV.Deliver.Exception:QuotaExceededException.MapiExceptionShutoffQuotaExceeded;
> Failed to process message due to a permanent exception with message Non
> riesco ad aprire la cassetta postale /o=Scai SpA/ou=Exchange Administrative
> Group (FYDIBOHF23SPDLT)/cn=Configuration/cn=Servers/cn=MAIL2/cn=Microsoft
> System Attendant. 16.55847:7E000000,
> 17.43559:0000000048010000000000002E00000000000000,
> 20.52176:000FE3832000000000000000, 20.50032:000FE3839017000000000000,
> 255.23226:000FE383, 255.27962:FE000000, 255.17082:DD040000,
> 0.26937:00000000, 4.21921:DD040000, 255.27962:FA000000, 255.1494:00000000,
> 0.50608:00000000,
> 5.29818:0000000034666631356633392D393430392D346230382D386264342D37666561303430643232623900000000,
> 1.29920:03000000, 7.29828:0A87B90E0000000000000000,
> 7.29832:0054B80E0000000000000000, 4.45884:DD040000, 4.29880:DD040000,
> 4.29888:DD040000, 1.56872:FE000000, 4.42712:DD040000,
> 5.10786:0000000031352E30302E303939352E3033323A6D61696C320000
>  0000, 255.1750:000FE383, 0.26849:00000000, 255.21817:DD040000,
> 0.26297:0A000000, 4.16585:DD040000, 0.32441:00000000, 4.1706:DD040000,
> 0.24761:0A000000, 4.20665:DD040000, 0.25785:00000000, 4.29881:DD040000
> [Stage: CreateSession]
> Remote-MTA: dns;MAIL2.scai.intra
>
>
>
> ---------- Forwarded message ----------
> From: manoj <ma...@gmail.com>
> To: <us...@hadoop.apache.org>
> Cc:
> Date: Fri, 14 Aug 2015 16:53:35 -0700
> Subject: Map tasks keep Running even after the node is killed on Apache
> Yarn.
> Hi,
>
> I'm on Apache2.6.0 YARN and I'm trying to test the dynamic addition and
> removal of nodes from the Cluster.
>
> The test starts a Job with 2 nodes and while the Job is progressing, It
> removes one of the node* by killing the dataNode and NodeManager Daemons.(
> is it ok to remove a node like this? )
>
> *this node is not running ResourceManager/ApplicationMaster for sure.
>
> After the node is successfully removed( I can confirm this from resource
> manager logs- attached) the test adds it back and waits till the job
> completes.
>
> Node Removal Logs:
>
> 2015-08-14 11:15:56,902 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:host172:36158 Timed out after 60 secs
> 2015-08-14 11:15:56,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node host172:36158 as it is now LOST
> 2015-08-14 11:15:56,904 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: host172:36158 Node Transitioned from RUNNING to LOST
> 2015-08-14 11:15:56,905 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1439575616861_0001_01_000006 Container Transitioned from RUNNING to KILLED
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1439575616861_0001_01_000006 in state: KILLED event:KILL
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop   OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1439575616861_0001    CONTAINERID=container_1439575616861_0001_01_000006
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1439575616861_0001_01_000006 of capacity <memory:1024, vCores:1> on host host172:36158, which currently has 1 containers, <memory:1024, vCores:1> used and <memory:1024, vCores:7> available, release resources=true
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=<memory:3584, vCores:3> numContainers=3 user=hadoop user-resources=<memory:3584, vCores:3>
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=Container: [ContainerId: container_1439575616861_0001_01_000006, NodeId: host172:36158, NodeHttpAddress: host172:8042, Resource: <memory:1024, vCores:1>, Priority: 20, Token: Token { kind: ContainerToken, service: XX.XX.0.2:36158 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:3584, vCores:3>, usedCapacity=1.75, absoluteUsedCapacity=1.75, numApps=1, numContainers=3 cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=1.75 absoluteUsedCapacity=1.75 used=<memory:3584, vCores:3> cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:3584, vCores:3>, usedCapacity=1.75, absoluteUsedCapacity=1.75, numApps=1, numContainers=3
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1439575616861_0001_000001 released container container_1439575616861_0001_01_000006 on node: host: host172:36158 #containers=1 available=1024 used=1024 with event: KILL
> 2015-08-14 11:15:56,907 INFO   org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1439575616861_0001_01_000005 Container Transitioned from RUNNING to KILLED
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1439575616861_0001_01_000005 in state: KILLED event:KILL
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop   OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1439575616861_0001    CONTAINERID=container_1439575616861_0001_01_000005
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1439575616861_0001_01_000005 of capacity <memory:1024, vCores:1> on host host172:36158, which currently has 0 containers, <memory:0, vCores:0> used and <memory:2048, vCores:8> available, release resources=true
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=<memory:2560, vCores:2> numContainers=2 user=hadoop user-resources=<memory:2560, vCores:2>
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=Container: [ContainerId: container_1439575616861_0001_01_000005, NodeId: host172:36158, NodeHttpAddress: host172:8042, Resource: <memory:1024, vCores:1>, Priority: 20, Token: Token { kind: ContainerToken, service: XX.XX.0.2:36158 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:2560, vCores:2>, usedCapacity=1.25, absoluteUsedCapacity=1.25, numApps=1, numContainers=2 cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=1.25 absoluteUsedCapacity=1.25 used=<memory:2560, vCores:2> cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:2560, vCores:2>, usedCapacity=1.25, absoluteUsedCapacity=1.25, numApps=1, numContainers=2
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1439575616861_0001_000001 released container container_1439575616861_0001_01_000005 on node: host: host172:36158 #containers=0 available=2048 used=0 with event: KILL
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Removed node host172:36158 clusterResource: <memory:2048, vCores:8>
>
> Node Addition logs:
>
> 2015-08-14 11:19:43,529 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved host172 to /default-rack
> 2015-08-14 11:19:43,530 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node host172(cmPort: 59426 httpPort: 8042) registered with capability: <memory:2048, vCores:8>, assigned nodeId host172:59426
> 2015-08-14 11:19:43,533 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: host172:59426 Node Transitioned from NEW to RUNNING
> 2015-08-14 11:19:43,535 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Added node host172:59426 clusterResource: <memory:4096, vCores:16>
>
> *Here's the problem:*
>
> The Job never completes! According to the logs the mapTasks which were
> scheduled on the node that was removed are still "RUNNING" with a
> mapProgress of 100%. These tasks stays in the same state forever.
>
> In the AppMasterContainer logs I see that it continuously tries to connect
> to the previous node host172/XX.XX.XX.XX:36158 though it was removed and
> added on a different port host172/XX.XX.XX.XX:59426
>
> ......
> ...
> 2015-08-14 11:25:21,662 INFO [ContainerLauncher #7] org.apache.hadoop.ipc.Client: Retrying connect to server: host172/XX.XX.XX.XX:36158. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
> ...
> ...
>
> Please let me know if you need to see any more logs.
>
> P.S: The Jobs completes normally without dynamic addition and removal of
> nodes on the same Cluster with same memory settings.
> Thanks,
> --Manoj Kumar M
>
>


-- 
--Manoj Kumar M