You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Roman Khachatryan (Jira)" <ji...@apache.org> on 2020/05/26 20:12:00 UTC

[jira] [Comment Edited] (FLINK-17933) TaskManager was terminated on Yarn - investigate

    [ https://issues.apache.org/jira/browse/FLINK-17933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117020#comment-17117020 ] 

Roman Khachatryan edited comment on FLINK-17933 at 5/26/20, 8:11 PM:
---------------------------------------------------------------------

One of the reasons - disk space (for TMs):
{code:java}
2020-05-26 19:53:22,779 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection (DiskHealthMonitor-Timer): Directory /mnt/yarn error, used space above threshold of 90.0%, removing from list of valid directories
2020-05-26 19:53:22,910 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection (DiskHealthMonitor-Timer): Directory /var/log/hadoop-yarn/containers error, used space above threshold of 90.0%, removing from list of valid directories
2020-05-26 19:53:22,910 INFO org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService (DiskHealthMonitor-Timer): Disk(s) failed: 1/4 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
2020-05-26 19:53:22,910 ERROR org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService (DiskHealthMonitor-Timer): Most of the disks failed. 1/4 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
2020-05-26 19:53:23,135 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService (AsyncDispatcher event handler):Cache Size Before Clean: 0, Total Deleted: 0, Public Deleted: 0, Private Deleted: 0
2020-05-26 19:53:23,165 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl (AsyncDispatcher event handler): Container container_1589922255142_0164_01_000002 transitioned from RUNNING to KILLING
2020-05-26 19:53:23,165 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch (AsyncDispatcher event handler): Cleaning up container container_1589922255142_0164_01_000002
2020-05-26 19:53:23,185 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor (ContainersLauncher #231): Exit code from container container_1589922255142_0164_01_000002 is : 143{code}


was (Author: roman_khachatryan):
One of the reasons - disk space:
{code:java}
2020-05-26 19:53:22,779 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection (DiskHealthMonitor-Timer): Directory /mnt/yarn error, used space above threshold of 90.0%, removing from list of valid directories
2020-05-26 19:53:22,910 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection (DiskHealthMonitor-Timer): Directory /var/log/hadoop-yarn/containers error, used space above threshold of 90.0%, removing from list of valid directories
2020-05-26 19:53:22,910 INFO org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService (DiskHealthMonitor-Timer): Disk(s) failed: 1/4 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
2020-05-26 19:53:22,910 ERROR org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService (DiskHealthMonitor-Timer): Most of the disks failed. 1/4 local-dirs are bad: /mnt/yarn; 1/1 log-dirs are bad: /var/log/hadoop-yarn/containers
2020-05-26 19:53:23,135 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService (AsyncDispatcher event handler):Cache Size Before Clean: 0, Total Deleted: 0, Public Deleted: 0, Private Deleted: 0
2020-05-26 19:53:23,165 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl (AsyncDispatcher event handler): Container container_1589922255142_0164_01_000002 transitioned from RUNNING to KILLING
2020-05-26 19:53:23,165 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch (AsyncDispatcher event handler): Cleaning up container container_1589922255142_0164_01_000002
2020-05-26 19:53:23,185 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor (ContainersLauncher #231): Exit code from container container_1589922255142_0164_01_000002 is : 143{code}

> TaskManager was terminated on Yarn - investigate
> ------------------------------------------------
>
>                 Key: FLINK-17933
>                 URL: https://issues.apache.org/jira/browse/FLINK-17933
>             Project: Flink
>          Issue Type: Task
>          Components: Deployment / YARN, Runtime / Task
>    Affects Versions: 1.11.0
>            Reporter: Roman Khachatryan
>            Assignee: Roman Khachatryan
>            Priority: Major
>             Fix For: 1.11.0
>
>
> When running a job on Yarn cluster (load testing) some jobs result in failures.
> Initial symptoms are no bytes written/transferred in CSV and failures in logs: 
> {code:java}
> 2020-05-17 10:02:32,858 WARN org.apache.flink.runtime.taskmanager.Task [] - Map -> Flat Map (138/160) (e49f7ea26b633c8035f2a919b1c580c8) switched from RUNNING to FAILED.{code}
>  
> It turned out that all such failures were caused by "Connection reset" from a single IP, except for one "Leadership lost" error (another IP).
> Connection reset was likely caused by TM receiving SIGTERM (container_1589453804748_0118_01_000004 and 5 both on ip-172-31-42-229):
> {code:java}
> 2020-05-17 10:02:31,362 INFO org.apache.flink.yarn.YarnTaskExecutorRunner [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.{code}
>  
> Other TMs received SIGTERM one minute later (all logs were uploaded at the same time though).
>  
> From the JM it looked like this:
> {code:java}
> 2020-05-17 10:02:23,583 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] - Trigger heartbeat request.
> 2020-05-17 10:02:23,587 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] - Received heartbeat from container_1589453804748_0118_01_000005.
> 2020-05-17 10:02:23,590 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] - Received heartbeat from container_1589453804748_0118_01_000006.
> 2020-05-17 10:02:23,592 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] - Received heartbeat from container_1589453804748_0118_01_000004.
> 2020-05-17 10:02:23,595 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] - Received heartbeat from container_1589453804748_0118_01_000003.
> 2020-05-17 10:02:23,598 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] - Received heartbeat from container_1589453804748_0118_01_000002.
> 2020-05-17 10:02:23,725 DEBUG org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received acknowledge message for checkpoint 12 from task 459efd2ad8fe2ffe7fffe28530064fe1 of job 5d4d8c88de23b1361fe0dce6ba8443f8 at container_1589453804748_0118_01_000002 @ ip-172-31-43-69.eu-central-1.compute.internal (dataPort=44625).
> 2020-05-17 10:02:29,103 DEBUG org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received acknowledge message for checkpoint 12 from task 266a9326be7e3ec669cce2e6a97ae5b0 of job 5d4d8c88de23b1361fe0dce6ba8443f8 at container_1589453804748_0118_01_000005 @ ip-172-31-42-229.eu-central-1.compute.internal (dataPort=37329).
> 2020-05-17 10:02:32,862 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@ip-172-31-42-229.eu-central-1.compute.internal:39999] has failed, address is now gated for [50] ms. Reason: [Disassociated]
> 2020-05-17 10:02:32,862 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@ip-172-31-42-229.eu-central-1.compute.internal:42567] has failed, address is now gated for [50] ms. Reason: [Disassociated]
> 2020-05-17 10:02:32,900 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Map -> Flat Map (87/160) (cb77c7002503baa74baf73a3a100c2f2) switched from RUNNING to FAILED.
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: readAddress(..) failed: Connection reset by peer (connection to 'ip-172-31-42-229.eu-central-1.compute.internal/172.31.42.229:37329'){code}
>  
> There are also JobManager heartbeat timeouts but they don't correlate with the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)