You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2019/04/12 09:06:00 UTC
[jira] [Reopened] (FLINK-10941) Slots prematurely released which
still contain unconsumed data
[ https://issues.apache.org/jira/browse/FLINK-10941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Till Rohrmann reopened FLINK-10941:
-----------------------------------
Backport for 1.8.1 and 1.7.x still pending
> Slots prematurely released which still contain unconsumed data
> ---------------------------------------------------------------
>
> Key: FLINK-10941
> URL: https://issues.apache.org/jira/browse/FLINK-10941
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.5.5, 1.6.2, 1.7.0
> Reporter: Qi
> Assignee: Andrey Zagrebin
> Priority: Critical
> Labels: pull-request-available
> Fix For: 1.9.0
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Our case is: Flink 1.5 batch mode, 32 parallelism to read data source and 4 parallelism to write data sink.
>
> The read task worked perfectly with 32 TMs. However when the job was executing the write task, since only 4 TMs were needed, other 28 TMs were released. This caused RemoteTransportException in the write task:
>
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager ’the_previous_TM_used_by_read_task'. This might indicate that the remote task manager was lost.
> at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:133)
> at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
> ...
>
> After skimming YarnFlinkResourceManager related code, it seems to me that Flink is releasing TMs when they’re idle, regardless of whether working TMs need them.
>
> Put in another way, Flink seems to prematurely release slots which contain unconsumed data and, thus, eventually release a TM which then fails a consuming task.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)