You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Juho Autio <ju...@rovio.com> on 2018/08/13 07:52:04 UTC

Re: 1.5.1

I also have jobs failing on a daily basis with the error "Heartbeat of
TaskManager with id <id> timed out". I'm using Flink 1.5.2.

Could anyone suggest how to debug possible causes?

I already set these in flink-conf.yaml, but I'm still getting failures:
heartbeat.interval: 10000
heartbeat.timeout: 100000

Thanks.

On Sun, Jul 22, 2018 at 2:20 PM Vishal Santoshi <vi...@gmail.com>
wrote:

> According to the UI it seems that "
>
> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>
> " was the cause of a pipe restart.
>
> As to the TM it is an artifact of the new job allocation regime which will
> exhaust all slots on a TM rather then distributing them equitably.  TMs
> selectively are under more stress then in a pure RR distribution I think.
> We may have to lower the slots on each TM to define a good upper bound. You
> are correct 50s is a a pretty generous value.
>
> On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <ga...@data-artisans.com> wrote:
>
>> Hi,
>>
>> The first exception should be only logged on info level. It's expected to
>> see
>> this exception when a TaskManager unregisters from the ResourceManager.
>>
>> Heartbeats can be configured via heartbeat.interval and hearbeat.timeout
>> [1].
>> The default timeout is 50s, which should be a generous value. It is
>> probably a
>> good idea to find out why the heartbeats cannot be answered by the TM.
>>
>> Best,
>> Gary
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/config.html#heartbeat-manager
>>
>>
>> On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> 2 issues we are seeing on 1.5.1 on a streaming pipe line
>>>
>>> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>>
>>>
>>> and
>>>
>>> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 208af709ef7be2d2dfc028ba3bbf4600 timed out.
>>>
>>>
>>> Not sure about the first but how do we increase the heartbeat interval
>>> of a TM
>>>
>>> Thanks much
>>>
>>> Vishal
>>>
>>
>>
>

Re: 1.5.1

Posted by Juho Autio <ju...@rovio.com>.

Gary, I found another mail thread about similar issue:

http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Testing-on-Flink-1-5-tp19565p19647.html

Specifically I found this:

> we are observing Akka.ask.timeout error for few of our jobs (JM's
logs[2]), we tried to increase this parameter to 2 min from default 10 s
and still observe the same.

And your answer:

> Can you try setting:
>
>     web.timeout: 120000
>
> This will increase the timeout of the RPC call to 2 minutes.

Do you think that might fix my problem? Does Flink job execution internally
depend on the Flink REST API to answer quickly? Due to monitoring we're
making quite many calls to the Flink REST API (from outside the cluster),
which could be blocking other requests.

Looks like the default is just 10000 according to
https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/config.html#web-timeout

Cheers,
Juho

On Wed, Aug 15, 2018 at 11:43 AM Juho Autio <ju...@rovio.com> wrote:

> Vishal, from which version did you upgrade to 1.5.1? Maybe from 1.5.0
> (release)? Knowing that might help narrowing down the source of this.
>
> On Wed, Aug 15, 2018 at 11:38 AM Juho Autio <ju...@rovio.com> wrote:
>
>> Thanks Gary..
>>
>> What could be blocking the RPC threads? Slow checkpointing?
>>
>> In production we're still using a self-built Flink package 1.5-SNAPSHOT,
>> flink commit 8395508b0401353ed07375e22882e7581d46ac0e, and the jobs are
>> stable.
>>
>> Now with 1.5.2 the same jobs are failing due to heartbeat timeouts every
>> day. What changed between commit 8395508b0401353ed07375e22882e7581d46ac0e &
>> release 1.5.2?
>>
>> Also, I just tried to run a slightly heavier job. It eventually had some
>> heartbeat timeouts, and then this:
>>
>> 2018-08-15 01:49:58,156 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
>> Kafka (topic1, topic2) -> Filter -> AppIdFilter([topic1, topic2]) ->
>> XFilter -> EventMapFilter(AppFilters) (4/8)
>> (da6e2ba425fb91316dd05e72e6518b24) switched from RUNNING to FAILED.
>> org.apache.flink.util.FlinkException: The assigned slot
>> container_1534167926397_0001_01_000002_1 was removed.
>>
>> After that the job tried to restart according to Flink restart strategy
>> but that kept failing with this error:
>>
>> 2018-08-15 02:00:22,000 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job X
>> (19bd504d2480ccb2b44d84fb1ef8af68) switched from state RUNNING to FAILING.
>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>> Could not allocate all requires slots within timeout of 300000 ms. Slots
>> required: 36, slots allocated: 12
>>
>> This was repeated until all restart attempts had been used (we've set it
>> to 50), and then the job finally failed.
>>
>> I would like to know also how to prevent Flink from going into such bad
>> state. At least it should exit immediately instead of retrying in such a
>> situation. And why was "slot container removed"?
>>
>> On Tue, Aug 14, 2018 at 11:24 PM Gary Yao <ga...@data-artisans.com> wrote:
>>
>>> Hi Juho,
>>>
>>> It seems in your case the JobMaster did not receive a heartbeat from the
>>> TaskManager in time [1]. Heartbeat requests and answers are sent over
>>> the RPC
>>> framework, and RPCs of one component (e.g., TaskManager, JobMaster,
>>> etc.) are
>>> dispatched by a single thread. Therefore, the reasons for heartbeats
>>> timeouts
>>> include:
>>>
>>>     1. The RPC threads of the TM or JM are blocked. In this case
>>> heartbeat requests or answers cannot be dispatched.
>>>     2. The scheduled task for sending the heartbeat requests [2] died.
>>>     3. The network is flaky.
>>>
>>> If you are confident that the network is not the culprit, I would
>>> suggest to
>>> set the logging level to DEBUG, and look for periodic log messages (JM
>>> and TM
>>> logs) that are related to heartbeating. If the periodic log messages are
>>> overdue, it is a hint that the main thread of the RPC endpoint is blocked
>>> somewhere.
>>>
>>> Best,
>>> Gary
>>>
>>> [1]
>>> https://github.com/apache/flink/blob/release-1.5.2/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1611
>>> [2]
>>> https://github.com/apache/flink/blob/913b0413882939c30da4ad4df0cabc84dfe69ea0/flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManagerSenderImpl.java#L64
>>>
>>> On Mon, Aug 13, 2018 at 9:52 AM, Juho Autio <ju...@rovio.com>
>>> wrote:
>>>
>>>> I also have jobs failing on a daily basis with the error "Heartbeat of
>>>> TaskManager with id <id> timed out". I'm using Flink 1.5.2.
>>>>
>>>> Could anyone suggest how to debug possible causes?
>>>>
>>>> I already set these in flink-conf.yaml, but I'm still getting failures:
>>>> heartbeat.interval: 10000
>>>> heartbeat.timeout: 100000
>>>>
>>>> Thanks.
>>>>
>>>> On Sun, Jul 22, 2018 at 2:20 PM Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> According to the UI it seems that "
>>>>>
>>>>> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>>>>
>>>>> " was the cause of a pipe restart.
>>>>>
>>>>> As to the TM it is an artifact of the new job allocation regime which
>>>>> will exhaust all slots on a TM rather then distributing them equitably.
>>>>> TMs selectively are under more stress then in a pure RR distribution I
>>>>> think. We may have to lower the slots on each TM to define a good upper
>>>>> bound. You are correct 50s is a a pretty generous value.
>>>>>
>>>>> On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <ga...@data-artisans.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> The first exception should be only logged on info level. It's
>>>>>> expected to see
>>>>>> this exception when a TaskManager unregisters from the
>>>>>> ResourceManager.
>>>>>>
>>>>>> Heartbeats can be configured via heartbeat.interval and
>>>>>> hearbeat.timeout [1].
>>>>>> The default timeout is 50s, which should be a generous value. It is
>>>>>> probably a
>>>>>> good idea to find out why the heartbeats cannot be answered by the TM.
>>>>>>
>>>>>> Best,
>>>>>> Gary
>>>>>>
>>>>>> [1]
>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/config.html#heartbeat-manager
>>>>>>
>>>>>>
>>>>>> On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi <
>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>
>>>>>>> 2 issues we are seeing on 1.5.1 on a streaming pipe line
>>>>>>>
>>>>>>> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>>>>>>
>>>>>>>
>>>>>>> and
>>>>>>>
>>>>>>> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 208af709ef7be2d2dfc028ba3bbf4600 timed out.
>>>>>>>
>>>>>>>
>>>>>>> Not sure about the first but how do we increase the heartbeat
>>>>>>> interval of a TM
>>>>>>>
>>>>>>> Thanks much
>>>>>>>
>>>>>>> Vishal
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: 1.5.1

Posted by Juho Autio <ju...@rovio.com>.

Vishal, from which version did you upgrade to 1.5.1? Maybe from 1.5.0
(release)? Knowing that might help narrowing down the source of this.

On Wed, Aug 15, 2018 at 11:38 AM Juho Autio <ju...@rovio.com> wrote:

> Thanks Gary..
>
> What could be blocking the RPC threads? Slow checkpointing?
>
> In production we're still using a self-built Flink package 1.5-SNAPSHOT,
> flink commit 8395508b0401353ed07375e22882e7581d46ac0e, and the jobs are
> stable.
>
> Now with 1.5.2 the same jobs are failing due to heartbeat timeouts every
> day. What changed between commit 8395508b0401353ed07375e22882e7581d46ac0e &
> release 1.5.2?
>
> Also, I just tried to run a slightly heavier job. It eventually had some
> heartbeat timeouts, and then this:
>
> 2018-08-15 01:49:58,156 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
> Kafka (topic1, topic2) -> Filter -> AppIdFilter([topic1, topic2]) ->
> XFilter -> EventMapFilter(AppFilters) (4/8)
> (da6e2ba425fb91316dd05e72e6518b24) switched from RUNNING to FAILED.
> org.apache.flink.util.FlinkException: The assigned slot
> container_1534167926397_0001_01_000002_1 was removed.
>
> After that the job tried to restart according to Flink restart strategy
> but that kept failing with this error:
>
> 2018-08-15 02:00:22,000 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job X
> (19bd504d2480ccb2b44d84fb1ef8af68) switched from state RUNNING to FAILING.
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate all requires slots within timeout of 300000 ms. Slots
> required: 36, slots allocated: 12
>
> This was repeated until all restart attempts had been used (we've set it
> to 50), and then the job finally failed.
>
> I would like to know also how to prevent Flink from going into such bad
> state. At least it should exit immediately instead of retrying in such a
> situation. And why was "slot container removed"?
>
> On Tue, Aug 14, 2018 at 11:24 PM Gary Yao <ga...@data-artisans.com> wrote:
>
>> Hi Juho,
>>
>> It seems in your case the JobMaster did not receive a heartbeat from the
>> TaskManager in time [1]. Heartbeat requests and answers are sent over the
>> RPC
>> framework, and RPCs of one component (e.g., TaskManager, JobMaster, etc.)
>> are
>> dispatched by a single thread. Therefore, the reasons for heartbeats
>> timeouts
>> include:
>>
>>     1. The RPC threads of the TM or JM are blocked. In this case
>> heartbeat requests or answers cannot be dispatched.
>>     2. The scheduled task for sending the heartbeat requests [2] died.
>>     3. The network is flaky.
>>
>> If you are confident that the network is not the culprit, I would suggest
>> to
>> set the logging level to DEBUG, and look for periodic log messages (JM
>> and TM
>> logs) that are related to heartbeating. If the periodic log messages are
>> overdue, it is a hint that the main thread of the RPC endpoint is blocked
>> somewhere.
>>
>> Best,
>> Gary
>>
>> [1]
>> https://github.com/apache/flink/blob/release-1.5.2/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1611
>> [2]
>> https://github.com/apache/flink/blob/913b0413882939c30da4ad4df0cabc84dfe69ea0/flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManagerSenderImpl.java#L64
>>
>> On Mon, Aug 13, 2018 at 9:52 AM, Juho Autio <ju...@rovio.com> wrote:
>>
>>> I also have jobs failing on a daily basis with the error "Heartbeat of
>>> TaskManager with id <id> timed out". I'm using Flink 1.5.2.
>>>
>>> Could anyone suggest how to debug possible causes?
>>>
>>> I already set these in flink-conf.yaml, but I'm still getting failures:
>>> heartbeat.interval: 10000
>>> heartbeat.timeout: 100000
>>>
>>> Thanks.
>>>
>>> On Sun, Jul 22, 2018 at 2:20 PM Vishal Santoshi <
>>> vishal.santoshi@gmail.com> wrote:
>>>
>>>> According to the UI it seems that "
>>>>
>>>> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>>>
>>>> " was the cause of a pipe restart.
>>>>
>>>> As to the TM it is an artifact of the new job allocation regime which
>>>> will exhaust all slots on a TM rather then distributing them equitably.
>>>> TMs selectively are under more stress then in a pure RR distribution I
>>>> think. We may have to lower the slots on each TM to define a good upper
>>>> bound. You are correct 50s is a a pretty generous value.
>>>>
>>>> On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <ga...@data-artisans.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> The first exception should be only logged on info level. It's expected
>>>>> to see
>>>>> this exception when a TaskManager unregisters from the ResourceManager.
>>>>>
>>>>> Heartbeats can be configured via heartbeat.interval and
>>>>> hearbeat.timeout [1].
>>>>> The default timeout is 50s, which should be a generous value. It is
>>>>> probably a
>>>>> good idea to find out why the heartbeats cannot be answered by the TM.
>>>>>
>>>>> Best,
>>>>> Gary
>>>>>
>>>>> [1]
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/config.html#heartbeat-manager
>>>>>
>>>>>
>>>>> On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi <
>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>
>>>>>> 2 issues we are seeing on 1.5.1 on a streaming pipe line
>>>>>>
>>>>>> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>>>>>
>>>>>>
>>>>>> and
>>>>>>
>>>>>> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 208af709ef7be2d2dfc028ba3bbf4600 timed out.
>>>>>>
>>>>>>
>>>>>> Not sure about the first but how do we increase the heartbeat
>>>>>> interval of a TM
>>>>>>
>>>>>> Thanks much
>>>>>>
>>>>>> Vishal
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: 1.5.1

Posted by Juho Autio <ju...@rovio.com>.

For the record, this hasn't been a problem for us any more. Successfully
running Flink 1.6.0.

We set "web.timeout: 120000" in flink-conf.yaml, but so far I've gathered
that this setting doesn't have anything to do with heartbeat timeouts (?).
Most likely the heartbeat timeouts were caused by some temporary network
level issues on our clusters.

Thanks for all the comments.

On Wed, Aug 15, 2018 at 4:17 PM Gary Yao <ga...@data-artisans.com> wrote:

> Hi Juho,
>
> the main thread of the RPC endpoint should never be blocked. Blocking on
> that
> thread is considered an implementation error. Unfortunately, without logs
> it
> is difficult to tell what the exact problem is. If you are able to
> reproduce
> heartbeat timeouts on your test staging environment, can you send the full
> logs of the ClusterEntryPoint (JobManager) and the ones of the failing
> TaskManager? From the log extracts you sent, I cannot really draw any
> conclusions.
>
> Best,
> Gary
>
>
> On Wed, Aug 15, 2018 at 10:38 AM, Juho Autio <ju...@rovio.com> wrote:
>
>> Thanks Gary..
>>
>> What could be blocking the RPC threads? Slow checkpointing?
>>
>> In production we're still using a self-built Flink package 1.5-SNAPSHOT,
>> flink commit 8395508b0401353ed07375e22882e7581d46ac0e, and the jobs are
>> stable.
>>
>> Now with 1.5.2 the same jobs are failing due to heartbeat timeouts every
>> day. What changed between commit 8395508b0401353ed07375e22882e7581d46ac0e &
>> release 1.5.2?
>>
>> Also, I just tried to run a slightly heavier job. It eventually had some
>> heartbeat timeouts, and then this:
>>
>> 2018-08-15 01:49:58,156 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
>> Kafka (topic1, topic2) -> Filter -> AppIdFilter([topic1, topic2]) ->
>> XFilter -> EventMapFilter(AppFilters) (4/8)
>> (da6e2ba425fb91316dd05e72e6518b24) switched from RUNNING to FAILED.
>> org.apache.flink.util.FlinkException: The assigned slot
>> container_1534167926397_0001_01_000002_1 was removed.
>>
>> After that the job tried to restart according to Flink restart strategy
>> but that kept failing with this error:
>>
>> 2018-08-15 02:00:22,000 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job X
>> (19bd504d2480ccb2b44d84fb1ef8af68) switched from state RUNNING to FAILING.
>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>> Could not allocate all requires slots within timeout of 300000 ms. Slots
>> required: 36, slots allocated: 12
>>
>> This was repeated until all restart attempts had been used (we've set it
>> to 50), and then the job finally failed.
>>
>> I would like to know also how to prevent Flink from going into such bad
>> state. At least it should exit immediately instead of retrying in such a
>> situation. And why was "slot container removed"?
>>
>> On Tue, Aug 14, 2018 at 11:24 PM Gary Yao <ga...@data-artisans.com> wrote:
>>
>>> Hi Juho,
>>>
>>> It seems in your case the JobMaster did not receive a heartbeat from the
>>> TaskManager in time [1]. Heartbeat requests and answers are sent over
>>> the RPC
>>> framework, and RPCs of one component (e.g., TaskManager, JobMaster,
>>> etc.) are
>>> dispatched by a single thread. Therefore, the reasons for heartbeats
>>> timeouts
>>> include:
>>>
>>>     1. The RPC threads of the TM or JM are blocked. In this case
>>> heartbeat requests or answers cannot be dispatched.
>>>     2. The scheduled task for sending the heartbeat requests [2] died.
>>>     3. The network is flaky.
>>>
>>> If you are confident that the network is not the culprit, I would
>>> suggest to
>>> set the logging level to DEBUG, and look for periodic log messages (JM
>>> and TM
>>> logs) that are related to heartbeating. If the periodic log messages are
>>> overdue, it is a hint that the main thread of the RPC endpoint is blocked
>>> somewhere.
>>>
>>> Best,
>>> Gary
>>>
>>> [1]
>>> https://github.com/apache/flink/blob/release-1.5.2/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1611
>>> [2]
>>> https://github.com/apache/flink/blob/913b0413882939c30da4ad4df0cabc84dfe69ea0/flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManagerSenderImpl.java#L64
>>>
>>> On Mon, Aug 13, 2018 at 9:52 AM, Juho Autio <ju...@rovio.com>
>>> wrote:
>>>
>>>> I also have jobs failing on a daily basis with the error "Heartbeat of
>>>> TaskManager with id <id> timed out". I'm using Flink 1.5.2.
>>>>
>>>> Could anyone suggest how to debug possible causes?
>>>>
>>>> I already set these in flink-conf.yaml, but I'm still getting failures:
>>>> heartbeat.interval: 10000
>>>> heartbeat.timeout: 100000
>>>>
>>>> Thanks.
>>>>
>>>> On Sun, Jul 22, 2018 at 2:20 PM Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> According to the UI it seems that "
>>>>>
>>>>> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>>>>
>>>>> " was the cause of a pipe restart.
>>>>>
>>>>> As to the TM it is an artifact of the new job allocation regime which
>>>>> will exhaust all slots on a TM rather then distributing them equitably.
>>>>> TMs selectively are under more stress then in a pure RR distribution I
>>>>> think. We may have to lower the slots on each TM to define a good upper
>>>>> bound. You are correct 50s is a a pretty generous value.
>>>>>
>>>>> On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <ga...@data-artisans.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> The first exception should be only logged on info level. It's
>>>>>> expected to see
>>>>>> this exception when a TaskManager unregisters from the
>>>>>> ResourceManager.
>>>>>>
>>>>>> Heartbeats can be configured via heartbeat.interval and
>>>>>> hearbeat.timeout [1].
>>>>>> The default timeout is 50s, which should be a generous value. It is
>>>>>> probably a
>>>>>> good idea to find out why the heartbeats cannot be answered by the TM.
>>>>>>
>>>>>> Best,
>>>>>> Gary
>>>>>>
>>>>>> [1]
>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/config.html#heartbeat-manager
>>>>>>
>>>>>>
>>>>>> On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi <
>>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>>
>>>>>>> 2 issues we are seeing on 1.5.1 on a streaming pipe line
>>>>>>>
>>>>>>> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>>>>>>
>>>>>>>
>>>>>>> and
>>>>>>>
>>>>>>> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 208af709ef7be2d2dfc028ba3bbf4600 timed out.
>>>>>>>
>>>>>>>
>>>>>>> Not sure about the first but how do we increase the heartbeat
>>>>>>> interval of a TM
>>>>>>>
>>>>>>> Thanks much
>>>>>>>
>>>>>>> Vishal
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: 1.5.1

Posted by Gary Yao <ga...@data-artisans.com>.

Hi Juho,

the main thread of the RPC endpoint should never be blocked. Blocking on
that
thread is considered an implementation error. Unfortunately, without logs it
is difficult to tell what the exact problem is. If you are able to reproduce
heartbeat timeouts on your test staging environment, can you send the full
logs of the ClusterEntryPoint (JobManager) and the ones of the failing
TaskManager? From the log extracts you sent, I cannot really draw any
conclusions.

Best,
Gary


On Wed, Aug 15, 2018 at 10:38 AM, Juho Autio <ju...@rovio.com> wrote:

> Thanks Gary..
>
> What could be blocking the RPC threads? Slow checkpointing?
>
> In production we're still using a self-built Flink package 1.5-SNAPSHOT,
> flink commit 8395508b0401353ed07375e22882e7581d46ac0e, and the jobs are
> stable.
>
> Now with 1.5.2 the same jobs are failing due to heartbeat timeouts every
> day. What changed between commit 8395508b0401353ed07375e22882e7581d46ac0e
> & release 1.5.2?
>
> Also, I just tried to run a slightly heavier job. It eventually had some
> heartbeat timeouts, and then this:
>
> 2018-08-15 01:49:58,156 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph
>       - Source: Kafka (topic1, topic2) -> Filter -> AppIdFilter([topic1,
> topic2]) -> XFilter -> EventMapFilter(AppFilters) (4/8)
> (da6e2ba425fb91316dd05e72e6518b24) switched from RUNNING to FAILED.
> org.apache.flink.util.FlinkException: The assigned slot
> container_1534167926397_0001_01_000002_1 was removed.
>
> After that the job tried to restart according to Flink restart strategy
> but that kept failing with this error:
>
> 2018-08-15 02:00:22,000 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph
>       - Job X (19bd504d2480ccb2b44d84fb1ef8af68) switched from state
> RUNNING to FAILING.
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate all requires slots within timeout of 300000 ms. Slots
> required: 36, slots allocated: 12
>
> This was repeated until all restart attempts had been used (we've set it
> to 50), and then the job finally failed.
>
> I would like to know also how to prevent Flink from going into such bad
> state. At least it should exit immediately instead of retrying in such a
> situation. And why was "slot container removed"?
>
> On Tue, Aug 14, 2018 at 11:24 PM Gary Yao <ga...@data-artisans.com> wrote:
>
>> Hi Juho,
>>
>> It seems in your case the JobMaster did not receive a heartbeat from the
>> TaskManager in time [1]. Heartbeat requests and answers are sent over the
>> RPC
>> framework, and RPCs of one component (e.g., TaskManager, JobMaster, etc.)
>> are
>> dispatched by a single thread. Therefore, the reasons for heartbeats
>> timeouts
>> include:
>>
>>     1. The RPC threads of the TM or JM are blocked. In this case
>> heartbeat requests or answers cannot be dispatched.
>>     2. The scheduled task for sending the heartbeat requests [2] died.
>>     3. The network is flaky.
>>
>> If you are confident that the network is not the culprit, I would suggest
>> to
>> set the logging level to DEBUG, and look for periodic log messages (JM
>> and TM
>> logs) that are related to heartbeating. If the periodic log messages are
>> overdue, it is a hint that the main thread of the RPC endpoint is blocked
>> somewhere.
>>
>> Best,
>> Gary
>>
>> [1] https://github.com/apache/flink/blob/release-1.5.2/flink-
>> runtime/src/main/java/org/apache/flink/runtime/jobmaster
>> /JobMaster.java#L1611
>> [2] https://github.com/apache/flink/blob/913b0413882939c30da4ad4
>> df0cabc84dfe69ea0/flink-runtime/src/main/java/org/apache/flink/runtime/
>> heartbeat/HeartbeatManagerSenderImpl.java#L64
>>
>> On Mon, Aug 13, 2018 at 9:52 AM, Juho Autio <ju...@rovio.com> wrote:
>>
>>> I also have jobs failing on a daily basis with the error "Heartbeat of
>>> TaskManager with id <id> timed out". I'm using Flink 1.5.2.
>>>
>>> Could anyone suggest how to debug possible causes?
>>>
>>> I already set these in flink-conf.yaml, but I'm still getting failures:
>>> heartbeat.interval: 10000
>>> heartbeat.timeout: 100000
>>>
>>> Thanks.
>>>
>>> On Sun, Jul 22, 2018 at 2:20 PM Vishal Santoshi <
>>> vishal.santoshi@gmail.com> wrote:
>>>
>>>> According to the UI it seems that "
>>>>
>>>> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>>>
>>>> " was the cause of a pipe restart.
>>>>
>>>> As to the TM it is an artifact of the new job allocation regime which
>>>> will exhaust all slots on a TM rather then distributing them equitably.
>>>> TMs selectively are under more stress then in a pure RR distribution I
>>>> think. We may have to lower the slots on each TM to define a good upper
>>>> bound. You are correct 50s is a a pretty generous value.
>>>>
>>>> On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <ga...@data-artisans.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> The first exception should be only logged on info level. It's expected
>>>>> to see
>>>>> this exception when a TaskManager unregisters from the ResourceManager.
>>>>>
>>>>> Heartbeats can be configured via heartbeat.interval and
>>>>> hearbeat.timeout [1].
>>>>> The default timeout is 50s, which should be a generous value. It is
>>>>> probably a
>>>>> good idea to find out why the heartbeats cannot be answered by the TM.
>>>>>
>>>>> Best,
>>>>> Gary
>>>>>
>>>>> [1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/
>>>>> ops/config.html#heartbeat-manager
>>>>>
>>>>>
>>>>> On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi <
>>>>> vishal.santoshi@gmail.com> wrote:
>>>>>
>>>>>> 2 issues we are seeing on 1.5.1 on a streaming pipe line
>>>>>>
>>>>>> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>>>>>
>>>>>>
>>>>>> and
>>>>>>
>>>>>> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 208af709ef7be2d2dfc028ba3bbf4600 timed out.
>>>>>>
>>>>>>
>>>>>> Not sure about the first but how do we increase the heartbeat
>>>>>> interval of a TM
>>>>>>
>>>>>> Thanks much
>>>>>>
>>>>>> Vishal
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: 1.5.1

Posted by Juho Autio <ju...@rovio.com>.

Thanks Gary..

What could be blocking the RPC threads? Slow checkpointing?

In production we're still using a self-built Flink package 1.5-SNAPSHOT,
flink commit 8395508b0401353ed07375e22882e7581d46ac0e, and the jobs are
stable.

Now with 1.5.2 the same jobs are failing due to heartbeat timeouts every
day. What changed between commit 8395508b0401353ed07375e22882e7581d46ac0e &
release 1.5.2?

Also, I just tried to run a slightly heavier job. It eventually had some
heartbeat timeouts, and then this:

2018-08-15 01:49:58,156 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
Kafka (topic1, topic2) -> Filter -> AppIdFilter([topic1, topic2]) ->
XFilter -> EventMapFilter(AppFilters) (4/8)
(da6e2ba425fb91316dd05e72e6518b24) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The assigned slot
container_1534167926397_0001_01_000002_1 was removed.

After that the job tried to restart according to Flink restart strategy but
that kept failing with this error:

2018-08-15 02:00:22,000 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job X
(19bd504d2480ccb2b44d84fb1ef8af68) switched from state RUNNING to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Could not allocate all requires slots within timeout of 300000 ms. Slots
required: 36, slots allocated: 12

This was repeated until all restart attempts had been used (we've set it to
50), and then the job finally failed.

I would like to know also how to prevent Flink from going into such bad
state. At least it should exit immediately instead of retrying in such a
situation. And why was "slot container removed"?

On Tue, Aug 14, 2018 at 11:24 PM Gary Yao <ga...@data-artisans.com> wrote:

> Hi Juho,
>
> It seems in your case the JobMaster did not receive a heartbeat from the
> TaskManager in time [1]. Heartbeat requests and answers are sent over the
> RPC
> framework, and RPCs of one component (e.g., TaskManager, JobMaster, etc.)
> are
> dispatched by a single thread. Therefore, the reasons for heartbeats
> timeouts
> include:
>
>     1. The RPC threads of the TM or JM are blocked. In this case heartbeat
> requests or answers cannot be dispatched.
>     2. The scheduled task for sending the heartbeat requests [2] died.
>     3. The network is flaky.
>
> If you are confident that the network is not the culprit, I would suggest
> to
> set the logging level to DEBUG, and look for periodic log messages (JM and
> TM
> logs) that are related to heartbeating. If the periodic log messages are
> overdue, it is a hint that the main thread of the RPC endpoint is blocked
> somewhere.
>
> Best,
> Gary
>
> [1]
> https://github.com/apache/flink/blob/release-1.5.2/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1611
> [2]
> https://github.com/apache/flink/blob/913b0413882939c30da4ad4df0cabc84dfe69ea0/flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManagerSenderImpl.java#L64
>
> On Mon, Aug 13, 2018 at 9:52 AM, Juho Autio <ju...@rovio.com> wrote:
>
>> I also have jobs failing on a daily basis with the error "Heartbeat of
>> TaskManager with id <id> timed out". I'm using Flink 1.5.2.
>>
>> Could anyone suggest how to debug possible causes?
>>
>> I already set these in flink-conf.yaml, but I'm still getting failures:
>> heartbeat.interval: 10000
>> heartbeat.timeout: 100000
>>
>> Thanks.
>>
>> On Sun, Jul 22, 2018 at 2:20 PM Vishal Santoshi <
>> vishal.santoshi@gmail.com> wrote:
>>
>>> According to the UI it seems that "
>>>
>>> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>>
>>> " was the cause of a pipe restart.
>>>
>>> As to the TM it is an artifact of the new job allocation regime which
>>> will exhaust all slots on a TM rather then distributing them equitably.
>>> TMs selectively are under more stress then in a pure RR distribution I
>>> think. We may have to lower the slots on each TM to define a good upper
>>> bound. You are correct 50s is a a pretty generous value.
>>>
>>> On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <ga...@data-artisans.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> The first exception should be only logged on info level. It's expected
>>>> to see
>>>> this exception when a TaskManager unregisters from the ResourceManager.
>>>>
>>>> Heartbeats can be configured via heartbeat.interval and
>>>> hearbeat.timeout [1].
>>>> The default timeout is 50s, which should be a generous value. It is
>>>> probably a
>>>> good idea to find out why the heartbeats cannot be answered by the TM.
>>>>
>>>> Best,
>>>> Gary
>>>>
>>>> [1]
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/config.html#heartbeat-manager
>>>>
>>>>
>>>> On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi <
>>>> vishal.santoshi@gmail.com> wrote:
>>>>
>>>>> 2 issues we are seeing on 1.5.1 on a streaming pipe line
>>>>>
>>>>> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>>>>
>>>>>
>>>>> and
>>>>>
>>>>> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 208af709ef7be2d2dfc028ba3bbf4600 timed out.
>>>>>
>>>>>
>>>>> Not sure about the first but how do we increase the heartbeat interval
>>>>> of a TM
>>>>>
>>>>> Thanks much
>>>>>
>>>>> Vishal
>>>>>
>>>>
>>>>
>>>
>>
>

Re: 1.5.1

Posted by Gary Yao <ga...@data-artisans.com>.

Hi Juho,

It seems in your case the JobMaster did not receive a heartbeat from the
TaskManager in time [1]. Heartbeat requests and answers are sent over the
RPC
framework, and RPCs of one component (e.g., TaskManager, JobMaster, etc.)
are
dispatched by a single thread. Therefore, the reasons for heartbeats
timeouts
include:

    1. The RPC threads of the TM or JM are blocked. In this case heartbeat
requests or answers cannot be dispatched.
    2. The scheduled task for sending the heartbeat requests [2] died.
    3. The network is flaky.

If you are confident that the network is not the culprit, I would suggest to
set the logging level to DEBUG, and look for periodic log messages (JM and
TM
logs) that are related to heartbeating. If the periodic log messages are
overdue, it is a hint that the main thread of the RPC endpoint is blocked
somewhere.

Best,
Gary

[1]
https://github.com/apache/flink/blob/release-1.5.2/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1611
[2]
https://github.com/apache/flink/blob/913b0413882939c30da4ad4df0cabc84dfe69ea0/flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManagerSenderImpl.java#L64

On Mon, Aug 13, 2018 at 9:52 AM, Juho Autio <ju...@rovio.com> wrote:

> I also have jobs failing on a daily basis with the error "Heartbeat of
> TaskManager with id <id> timed out". I'm using Flink 1.5.2.
>
> Could anyone suggest how to debug possible causes?
>
> I already set these in flink-conf.yaml, but I'm still getting failures:
> heartbeat.interval: 10000
> heartbeat.timeout: 100000
>
> Thanks.
>
> On Sun, Jul 22, 2018 at 2:20 PM Vishal Santoshi <vi...@gmail.com>
> wrote:
>
>> According to the UI it seems that "
>>
>> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>
>> " was the cause of a pipe restart.
>>
>> As to the TM it is an artifact of the new job allocation regime which
>> will exhaust all slots on a TM rather then distributing them equitably.
>> TMs selectively are under more stress then in a pure RR distribution I
>> think. We may have to lower the slots on each TM to define a good upper
>> bound. You are correct 50s is a a pretty generous value.
>>
>> On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <ga...@data-artisans.com> wrote:
>>
>>> Hi,
>>>
>>> The first exception should be only logged on info level. It's expected
>>> to see
>>> this exception when a TaskManager unregisters from the ResourceManager.
>>>
>>> Heartbeats can be configured via heartbeat.interval and hearbeat.timeout
>>> [1].
>>> The default timeout is 50s, which should be a generous value. It is
>>> probably a
>>> good idea to find out why the heartbeats cannot be answered by the TM.
>>>
>>> Best,
>>> Gary
>>>
>>> [1] https://ci.apache.org/projects/flink/flink-docs-
>>> release-1.5/ops/config.html#heartbeat-manager
>>>
>>>
>>> On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi <
>>> vishal.santoshi@gmail.com> wrote:
>>>
>>>> 2 issues we are seeing on 1.5.1 on a streaming pipe line
>>>>
>>>> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
>>>>
>>>>
>>>> and
>>>>
>>>> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 208af709ef7be2d2dfc028ba3bbf4600 timed out.
>>>>
>>>>
>>>> Not sure about the first but how do we increase the heartbeat interval
>>>> of a TM
>>>>
>>>> Thanks much
>>>>
>>>> Vishal
>>>>
>>>
>>>
>>
>