You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Guowei Ma (JIRA)" <ji...@apache.org> on 2019/04/04 01:31:00 UTC
[jira] [Commented] (FLINK-12106) Jobmanager is killing FINISHED taskmanger containers, causing exception in still running Taskmanagers an

    [ https://issues.apache.org/jira/browse/FLINK-12106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809421#comment-16809421 ] 

Guowei Ma commented on FLINK-12106:
-----------------------------------

AFAIK, the community is working on it.  [FLINK-10941|https://issues.apache.org/jira/browse/FLINK-10941] has the same problem. 

This issue is related to the lifecycle control of Shuffle Resource. There have some related discussions and design[1][2].

[1] [https://docs.google.com/document/d/13vAJJxfRXAwI4MtO8dux8hHnNMw2Biu5XRrb_hvGehA/edit#heading=h.v7vhb7w01d61]

[2] [https://cwiki.apache.org/confluence/display/FLINK/FLIP-31%3A+Pluggable+Shuffle+Manager]

 

> Jobmanager is killing FINISHED taskmanger containers, causing exception in still running Taskmanagers an
> --------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-12106
>                 URL: https://issues.apache.org/jira/browse/FLINK-12106
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.7.2
>         Environment: Hadoop:  hdp/2.5.6.0-40
> Flink: 2.7.2
>            Reporter: John
>            Priority: Major
>
> When running a single flink job on YARN, some of the taskmanger containers reach the FINISHED state before others.  It appears that, after receiving final execution state FINISHED from a taskmanager, jobmanager is waiting ~68 seconds and then freeing the associated slot in the taskmanager.  After and additional 60 seconds, jobmanager is stopping the same taskmanger because TaskExecutor exceeded the idle timeout.
> Meanwhile, other taskmangers are still working to complete the job.  Within 10 seconds after the taskmanger container above is stopped, the remaining task managers receive an exception due to loss of connection to the stopped taskmanager.  These exceptions result job failure.
>  
> Relevant logs:
> 2019-04-03 13:49:00,013 INFO  org.apache.flink.yarn.YarnResourceManager                     - Registering TaskManager with ResourceID container_1553017480503_0158_01_000038 (akka.tcp://flink@hadoop4:42745/user/taskmanager_0) at ResourceManager
> 2019-04-03 13:49:05,900 INFO  org.apache.flink.yarn.YarnResourceManager                     - Registering TaskManager with ResourceID container_1553017480503_0158_01_000059 (akka.tcp://flink@hadoop9:55042/user/taskmanager_0) at ResourceManager
>  
>  
> 2019-04-03 13:48:51,132 INFO  org.apache.flink.yarn.YarnResourceManager                     - Received new container: container_1553017480503_0158_01_000077 - Remaining pending container requests: 6
> 2019-04-03 13:48:52,862 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  -     -Dlog.file=/hadoop/yarn/log/application_1553017480503_0158/container_1553017480503_0158_01_000077/taskmanager.log
> 2019-04-03 13:48:57,490 INFO  org.apache.flink.runtime.io.network.netty.NettyServer         - Successful initialization (took 202 ms). Listening on SocketAddress /192.168.230.69:40140.
> 2019-04-03 13:49:12,575 INFO  org.apache.flink.yarn.YarnResourceManager                     - Registering TaskManager with ResourceID container_1553017480503_0158_01_000077 (akka.tcp://flink@hadoop9:51525/user/taskmanager_0) at ResourceManager
> 2019-04-03 13:49:12,631 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Allocated slot for AllocationID\{42fed3e5a136240c23cc7b394e3249e9}.
> 2019-04-03 14:58:15,188 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Un-registering task and sending final execution state FINISHED to JobManager for task DataSink (com.anovadata.alexflinklib.sinks.bucketing.BucketingOutputFormat@26874f2c) a4b5fb32830d4561147b2714828109e2.
> 2019-04-03 14:59:23,049 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Releasing idle slot [AllocationID\{42fed3e5a136240c23cc7b394e3249e9}].
> 2019-04-03 14:59:23,058 INFO  org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable      - Free slot TaskSlot(index:0, state:ACTIVE, resource profile: ResourceProfile\{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647, directMemoryInMB=2147483647, nativeMemoryInMB=2147483647, networkMemoryInMB=2147483647}, allocationId: AllocationID\{42fed3e5a136240c23cc7b394e3249e9}, jobId: a6c4e367698c15cdf168d19a89faff1d).
> 2019-04-03 15:00:02,641 INFO  org.apache.flink.yarn.YarnResourceManager                     - Stopping container container_1553017480503_0158_01_000077.
> 2019-04-03 15:00:02,646 INFO  org.apache.flink.yarn.YarnResourceManager                     - Closing TaskExecutor connection container_1553017480503_0158_01_000077 because: TaskExecutor exceeded the idle timeout.
>  
>  
> 2019-04-03 13:48:48,902 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  -     -Dlog.file=/data1/hadoop/yarn/log/application_1553017480503_0158/container_1553017480503_0158_01_000059/taskmanager.log
> 2019-04-03 14:59:24,677 INFO  org.apache.parquet.hadoop.InternalParquetRecordWriter         - Flushing mem columnStore to file. allocated memory: 109479981
> 2019-04-03 15:00:05,696 INFO  org.apache.parquet.hadoop.InternalParquetRecordWriter         - mem size 135014409 > 134217728: flushing 1930100 records to disk.
> 2019-04-03 15:00:05,696 INFO  org.apache.parquet.hadoop.InternalParquetRecordWriter         - Flushing mem columnStore to file. allocated memory: 102677684
> 2019-04-03 15:00:08,671 ERROR org.apache.flink.runtime.operators.BatchTask                  - Error in task code:  CHAIN Partition -> FlatMap 
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Lost connection to task manager 'hadoop9/192.168.230.69:40140'. This indicates that the remote task manager was lost.
> 2019-04-03 15:00:08,714 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Un-registering task and sending final execution state FAILED to JobManager for task CHAIN Partition -> FlatMap
> 2019-04-03 15:00:08,812 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to cancel task DataSink ()
> 2019-04-03 15:00:08,812 INFO  org.apache.flink.runtime.taskmanager.Task                     - DataSink () switched from RUNNING to CANCELING.
> 2019-04-03 15:00:08,812 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code DataSink ()
>  
>  
> 2019-04-03 13:48:44,562 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  -     -Dlog.file=/data8/hadoop/yarn/log/application_1553017480503_0158/container_1553017480503_0158_01_000038/taskmanager.log
> 2019-04-03 14:59:18,620 INFO  org.apache.parquet.hadoop.InternalParquetRecordWriter         - Flushing mem columnStore to file. allocated memory: 0
> 2019-04-03 14:59:48,088 INFO  org.apache.parquet.hadoop.InternalParquetRecordWriter         - mem size 136179972 > 134217728: flushing 1930100 records to disk.
> 2019-04-03 14:59:48,088 INFO  org.apache.parquet.hadoop.InternalParquetRecordWriter         - Flushing mem columnStore to file. allocated memory: 103333893
> 2019-04-03 15:00:08,692 ERROR org.apache.flink.runtime.operators.BatchTask                  - Error in task code:  CHAIN Partition -> FlatMap
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Lost connection to task manager 'hadoop9/192.168.230.69:40140'. This indicates that the remote task manager was lost.
> 2019-04-03 15:00:08,741 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Un-registering task and sending final execution state FAILED to JobManager for task CHAIN Partition -> FlatMap
> 2019-04-03 15:00:08,817 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to cancel task DataSink ()
> 2019-04-03 15:00:08,817 INFO  org.apache.flink.runtime.taskmanager.Task                     - DataSink () switched from RUNNING to CANCELING.
> 2019-04-03 15:00:08,817 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code DataSink ()
>  
>  
> 2019-04-03 15:00:09,196 INFO  org.apache.flink.runtime.dispatcher.MiniDispatcher            - Job a6c4e367698c15cdf168d19a89faff1d reached globally terminal state FAILED.
>  
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)