You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by manoj <ma...@gmail.com> on 2015/08/18 01:59:31 UTC
Re: Non recapitabile: Map tasks keep Running even after the node is killed on Apache Yarn.

Looks like the App master tries to connect to the Node that was removed for
~30Min before it gives up.
Can I reduce this wait time and number of tries?


Thanks
-Manoj

On Fri, Aug 14, 2015 at 4:54 PM, <po...@scai.it> wrote:

> *Il recapito non è riuscito per i seguenti destinatari o gruppi:*
>
> bigdatagroup@itecons.it
> La cassetta postale del destinatario è piena e non può accettare messaggi
> in questo momento. Riprova a inviare il messaggio più tardi.
>
> Il tuo messaggio è stato rifiutato dalla seguente organizzazione:
> MAIL2.scai.intra.
>
>
>
>
>
>
> *Informazioni di diagnostica per gli amministratori:*
>
> Server di generazione: mail2.scai.intra
>
> bigdatagroup@itecons.it
> MAIL2.scai.intra
> Remote Server returned '554 5.2.2 mailbox full;
> STOREDRV.Deliver.Exception:QuotaExceededException.MapiExceptionShutoffQuotaExceeded;
> Failed to process message due to a permanent exception with message Non
> riesco ad aprire la cassetta postale /o=Scai SpA/ou=Exchange Administrative
> Group (FYDIBOHF23SPDLT)/cn=Configuration/cn=Servers/cn=MAIL2/cn=Microsoft
> System Attendant. 16.55847:7E000000,
> 17.43559:0000000048010000000000002E00000000000000,
> 20.52176:000FE3832000000000000000, 20.50032:000FE3839017000000000000,
> 255.23226:000FE383, 255.27962:FE000000, 255.17082:DD040000,
> 0.26937:00000000, 4.21921:DD040000, 255.27962:FA000000, 255.1494:00000000,
> 0.50608:00000000,
> 5.29818:0000000034666631356633392D393430392D346230382D386264342D37666561303430643232623900000000,
> 1.29920:03000000, 7.29828:0A87B90E0000000000000000,
> 7.29832:0054B80E0000000000000000, 4.45884:DD040000, 4.29880:DD040000,
> 4.29888:DD040000, 1.56872:FE000000, 4.42712:DD040000,
> 5.10786:0000000031352E30302E303939352E3033323A6D61696C3200000000,
> 255.1750:000FE383, 0.26849:00000000, 255.21817:DD040000, 0.26297:0A000000,
> 4.16585:DD040000, 0.32441:00000000, 4.1706:DD040000, 0.24761:0A000000,
> 4.20665:DD040000, 0.25785:00000000, 4.29881:DD040000 [Stage: CreateSession]'
>
> Intestazioni originali del messaggio:
>
> Received: from MAIL2.scai.intra (10.110.4.14) by mail2.scai.intra
>  (10.110.4.14) with Microsoft SMTP Server (TLS) id 15.0.995.29; Sat, 15 Aug
>  2015 01:54:02 +0200
> Received: from mail.grupposcai.it (10.110.4.1) by MAIL2.scai.intra
>  (10.110.4.14) with Microsoft SMTP Server id 15.0.995.29 via Frontend
>  Transport; Sat, 15 Aug 2015 01:54:02 +0200
> X-BYPSHEADER: 12196482
> X-SMScore: -100000
> X-LCID: 9520602
> Received: from [(10.110.4.1)] by GTW-DMZ with Xeams SMTP; Sat, 15 Aug 2015 01:53:57 +0200 (CEST)
> X-SM_EnvelopeFrom: manojm.321@gmail.com
> X-SM_RECEIVED_ON: Sat, 15 Aug 2015 01:53:57 +0200 (CEST)
> X-SMScore: -970
> X-LCID: 9520600
> Received: from [(140.211.11.3)] by GTW-DMZ with Xeams SMTP; Sat, 15 Aug 2015 01:53:45 +0200 (CEST)
> X-SM_EnvelopeFrom: user-return-20396-bigdatagroup=itecons.it@hadoop.apache.org
> X-SM_RECEIVED_ON: Sat, 15 Aug 2015 01:53:45 +0200 (CEST)
> Received: (qmail 45474 invoked by uid 500); 14 Aug 2015 23:53:40 -0000
> Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
> Precedence: bulk
> List-Help: <ma...@hadoop.apache.org>
> List-Unsubscribe: <ma...@hadoop.apache.org>
> List-Post: <ma...@hadoop.apache.org>
> List-Id: <user.hadoop.apache.org>
> Reply-To: <us...@hadoop.apache.org>
> Delivered-To: mailing list user@hadoop.apache.org
> Received: (qmail 45460 invoked by uid 99); 14 Aug 2015 23:53:40 -0000
> Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142)
>     by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Aug 2015 23:53:40 +0000
> Received: from localhost (localhost [127.0.0.1])
> 	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id B1120DDE37
> 	for <us...@hadoop.apache.org>; Fri, 14 Aug 2015 23:53:39 +0000 (UTC)
> X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
> X-Spam-Flag: NO
> X-Spam-Score: 3.129
> X-Spam-Level: ***
> X-Spam-Status: No, score=3.129 tagged_above=-999 required=6.31
> 	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
> 	FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3,
> 	RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001]
> 	autolearn=disabled
> Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
> 	dkim=pass (2048-bit key) header.d=gmail.com
> Received: from mx1-eu-west.apache.org ([10.40.0.8])
> 	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
> 	with ESMTP id 7p9eDKlZHd2d for <us...@hadoop.apache.org>;
> 	Fri, 14 Aug 2015 23:53:37 +0000 (UTC)
> Received: from mail-oi0-f50.google.com (mail-oi0-f50.google.com [209.85.218.50])
> 	by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id D9CA42136C
> 	for <us...@hadoop.apache.org>; Fri, 14 Aug 2015 23:53:36 +0000 (UTC)
> Received: by oip136 with SMTP id 136so52683028oip.1
>         for <us...@hadoop.apache.org>; Fri, 14 Aug 2015 16:53:35 -0700 (PDT)
> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
>         d=gmail.com; s=20120113;
>         h=mime-version:date:message-id:subject:from:to:content-type;
>         bh=T1wf4bljaGb/zvLHrMaN7Q+F20Hqif18v22xBCeBLns=;
>         b=qy92Y3pzKefMnhUQIZnh4hS1+n8pN7c0RomzeWyzQZnTDroUk76CvZyxBt0nb+9YNb
>          oPMHKbQLWUvU+qE5N6tBJXu8uPEE0Rzju7n0XJ1AhgAO409atHt5lJsh9X0yz1CU3szK
>          oP70vwr33UlObl10O4lqBnFrVFAX9cK44zh3jYKxvO1gRxk4g5XnW2swmeDrldYf0eR4
>          eibUuU9H1j03RiTggrhVFOhuqs4zVxEIcn7KYDIXxtlkaq3RZMlRIAtg7e/aRttQcwbg
>          f3CuCa/zKtJTEKHCCI+3HQkneVeMHVcwe86UTl/jTDZ5sL0m7rJWSZdyLumhqDDaHmJ0
>          AhKA==
> MIME-Version: 1.0
> X-Received: by 10.202.48.200 with SMTP id w191mr13116197oiw.13.1439596415743;
>  Fri, 14 Aug 2015 16:53:35 -0700 (PDT)
> Received: by 10.182.22.170 with HTTP; Fri, 14 Aug 2015 16:53:35 -0700 (PDT)
> Date: Fri, 14 Aug 2015 16:53:35 -0700
> Message-ID: <CA...@mail.gmail.com>
> Subject: Map tasks keep Running even after the node is killed on Apache Yarn.
> From: manoj <ma...@gmail.com>
> To: <us...@hadoop.apache.org>
> Content-Type: multipart/alternative; boundary="001a113cd878fafe24051d4e28bf"
> Return-Path: manojm.321@gmail.com
>
>
> Final-Recipient: rfc822;bigdatagroup@itecons.it
> Action: failed
> Status: 5.2.2
> Diagnostic-Code: smtp;554 5.2.2 mailbox full;
> STOREDRV.Deliver.Exception:QuotaExceededException.MapiExceptionShutoffQuotaExceeded;
> Failed to process message due to a permanent exception with message Non
> riesco ad aprire la cassetta postale /o=Scai SpA/ou=Exchange Administrative
> Group (FYDIBOHF23SPDLT)/cn=Configuration/cn=Servers/cn=MAIL2/cn=Microsoft
> System Attendant. 16.55847:7E000000,
> 17.43559:0000000048010000000000002E00000000000000,
> 20.52176:000FE3832000000000000000, 20.50032:000FE3839017000000000000,
> 255.23226:000FE383, 255.27962:FE000000, 255.17082:DD040000,
> 0.26937:00000000, 4.21921:DD040000, 255.27962:FA000000, 255.1494:00000000,
> 0.50608:00000000,
> 5.29818:0000000034666631356633392D393430392D346230382D386264342D37666561303430643232623900000000,
> 1.29920:03000000, 7.29828:0A87B90E0000000000000000,
> 7.29832:0054B80E0000000000000000, 4.45884:DD040000, 4.29880:DD040000,
> 4.29888:DD040000, 1.56872:FE000000, 4.42712:DD040000,
> 5.10786:0000000031352E30302E303939352E3033323A6D61696C320000
>  0000, 255.1750:000FE383, 0.26849:00000000, 255.21817:DD040000,
> 0.26297:0A000000, 4.16585:DD040000, 0.32441:00000000, 4.1706:DD040000,
> 0.24761:0A000000, 4.20665:DD040000, 0.25785:00000000, 4.29881:DD040000
> [Stage: CreateSession]
> Remote-MTA: dns;MAIL2.scai.intra
>
>
>
> ---------- Forwarded message ----------
> From: manoj <ma...@gmail.com>
> To: <us...@hadoop.apache.org>
> Cc:
> Date: Fri, 14 Aug 2015 16:53:35 -0700
> Subject: Map tasks keep Running even after the node is killed on Apache
> Yarn.
> Hi,
>
> I'm on Apache2.6.0 YARN and I'm trying to test the dynamic addition and
> removal of nodes from the Cluster.
>
> The test starts a Job with 2 nodes and while the Job is progressing, It
> removes one of the node* by killing the dataNode and NodeManager Daemons.(
> is it ok to remove a node like this? )
>
> *this node is not running ResourceManager/ApplicationMaster for sure.
>
> After the node is successfully removed( I can confirm this from resource
> manager logs- attached) the test adds it back and waits till the job
> completes.
>
> Node Removal Logs:
>
> 2015-08-14 11:15:56,902 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:host172:36158 Timed out after 60 secs
> 2015-08-14 11:15:56,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node host172:36158 as it is now LOST
> 2015-08-14 11:15:56,904 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: host172:36158 Node Transitioned from RUNNING to LOST
> 2015-08-14 11:15:56,905 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1439575616861_0001_01_000006 Container Transitioned from RUNNING to KILLED
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1439575616861_0001_01_000006 in state: KILLED event:KILL
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop   OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1439575616861_0001    CONTAINERID=container_1439575616861_0001_01_000006
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1439575616861_0001_01_000006 of capacity <memory:1024, vCores:1> on host host172:36158, which currently has 1 containers, <memory:1024, vCores:1> used and <memory:1024, vCores:7> available, release resources=true
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=<memory:3584, vCores:3> numContainers=3 user=hadoop user-resources=<memory:3584, vCores:3>
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=Container: [ContainerId: container_1439575616861_0001_01_000006, NodeId: host172:36158, NodeHttpAddress: host172:8042, Resource: <memory:1024, vCores:1>, Priority: 20, Token: Token { kind: ContainerToken, service: XX.XX.0.2:36158 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:3584, vCores:3>, usedCapacity=1.75, absoluteUsedCapacity=1.75, numApps=1, numContainers=3 cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=1.75 absoluteUsedCapacity=1.75 used=<memory:3584, vCores:3> cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:3584, vCores:3>, usedCapacity=1.75, absoluteUsedCapacity=1.75, numApps=1, numContainers=3
> 2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1439575616861_0001_000001 released container container_1439575616861_0001_01_000006 on node: host: host172:36158 #containers=1 available=1024 used=1024 with event: KILL
> 2015-08-14 11:15:56,907 INFO   org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1439575616861_0001_01_000005 Container Transitioned from RUNNING to KILLED
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1439575616861_0001_01_000005 in state: KILLED event:KILL
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop   OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1439575616861_0001    CONTAINERID=container_1439575616861_0001_01_000005
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1439575616861_0001_01_000005 of capacity <memory:1024, vCores:1> on host host172:36158, which currently has 0 containers, <memory:0, vCores:0> used and <memory:2048, vCores:8> available, release resources=true
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=<memory:2560, vCores:2> numContainers=2 user=hadoop user-resources=<memory:2560, vCores:2>
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=Container: [ContainerId: container_1439575616861_0001_01_000005, NodeId: host172:36158, NodeHttpAddress: host172:8042, Resource: <memory:1024, vCores:1>, Priority: 20, Token: Token { kind: ContainerToken, service: XX.XX.0.2:36158 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:2560, vCores:2>, usedCapacity=1.25, absoluteUsedCapacity=1.25, numApps=1, numContainers=2 cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=1.25 absoluteUsedCapacity=1.25 used=<memory:2560, vCores:2> cluster=<memory:2048, vCores:8>
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:2560, vCores:2>, usedCapacity=1.25, absoluteUsedCapacity=1.25, numApps=1, numContainers=2
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1439575616861_0001_000001 released container container_1439575616861_0001_01_000005 on node: host: host172:36158 #containers=0 available=2048 used=0 with event: KILL
> 2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Removed node host172:36158 clusterResource: <memory:2048, vCores:8>
>
> Node Addition logs:
>
> 2015-08-14 11:19:43,529 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved host172 to /default-rack
> 2015-08-14 11:19:43,530 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node host172(cmPort: 59426 httpPort: 8042) registered with capability: <memory:2048, vCores:8>, assigned nodeId host172:59426
> 2015-08-14 11:19:43,533 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: host172:59426 Node Transitioned from NEW to RUNNING
> 2015-08-14 11:19:43,535 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Added node host172:59426 clusterResource: <memory:4096, vCores:16>
>
> *Here's the problem:*
>
> The Job never completes! According to the logs the mapTasks which were
> scheduled on the node that was removed are still "RUNNING" with a
> mapProgress of 100%. These tasks stays in the same state forever.
>
> In the AppMasterContainer logs I see that it continuously tries to connect
> to the previous node host172/XX.XX.XX.XX:36158 though it was removed and
> added on a different port host172/XX.XX.XX.XX:59426
>
> ......
> ...
> 2015-08-14 11:25:21,662 INFO [ContainerLauncher #7] org.apache.hadoop.ipc.Client: Retrying connect to server: host172/XX.XX.XX.XX:36158. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
> ...
> ...
>
> Please let me know if you need to see any more logs.
>
> P.S: The Jobs completes normally without dynamic addition and removal of
> nodes on the same Cluster with same memory settings.
> Thanks,
> --Manoj Kumar M
>
>


-- 
--Manoj Kumar M