You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Bruno Aranda <ba...@apache.org> on 2018/08/18 22:57:29 UTC

Job Manager killed by Kubernetes during recovery

Hi,

I am experiencing an issue when a job manager is trying to recover using a
HA setup. When the job manager starts again and tries to resume from the
last checkpoints, it gets killed by Kubernetes (I guess), since I can see
the following in the logs while the jobs are deployed:

INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.

I am requesting enough memory for it, 3000Gi, and it is configured to use
2048Gb of memory. I have tried to increase the max perm size, but did not
see an improvement.

Any suggestions to help diagnose this?

I have the following:

Flink 1.6.0 (same with 1.5.1)
Azure AKS with Kubernetes 1.11
State management using RocksDB with checkpoints stored in Azure Data Lake

Thanks!

Bruno

Re: Job Manager killed by Kubernetes during recovery

Posted by Till Rohrmann <tr...@apache.org>.

Great to hear that you've resolved the problem and thanks for sharing the
solution. This will help others who might run into a similar problem.

Cheers,
Till

On Wed, Aug 22, 2018, 16:14 Bruno Aranda <ba...@apache.org> wrote:

> Actually, I have found the issue. It was a simple thing, really, once you
> know it of course.
>
> It was caused by the livenessProbe kicking in too early. For a Flink
> cluster with several jobs, the default 30 seconds I was using (after using
> the Flink helm chart in the examples) was not enough to let the job manager
> to fully recover and start. Increasing that, fixes the issue.
>
> I ended up with a job manager with 4000Gi as limit, 3000Gi requested, and
> configured to use 2048Gb. So I guess that was a red herring for me.
>
> Managed to see what was going on by using the kubectl "describe" action,
> where it was clearly indicated as an event.
>
> Thanks Vino and Till for your time!
>
> Bruno
>
> On Tue, 21 Aug 2018 at 10:21 Till Rohrmann <tr...@apache.org> wrote:
>
>> Hi Bruno,
>>
>> in order to debug this problem we would need a bit more information. In
>> particular, the logs of the cluster entrypoint and your K8s deployment
>> specification would be helpful. If you have some memory limits specified
>> these would also be interesting to know.
>>
>> Cheers,
>> Till
>>
>> On Sun, Aug 19, 2018 at 2:43 PM vino yang <ya...@gmail.com> wrote:
>>
>>> Hi Bruno,
>>>
>>> Ping Till for you, he may give you some useful information.
>>>
>>> Thanks, vino.
>>>
>>> Bruno Aranda <ba...@apache.org> 于2018年8月19日周日 上午6:57写道：
>>>
>>>> Hi,
>>>>
>>>> I am experiencing an issue when a job manager is trying to recover
>>>> using a HA setup. When the job manager starts again and tries to resume
>>>> from the last checkpoints, it gets killed by Kubernetes (I guess), since I
>>>> can see the following in the logs while the jobs are deployed:
>>>>
>>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>>> RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
>>>>
>>>> I am requesting enough memory for it, 3000Gi, and it is configured to
>>>> use 2048Gb of memory. I have tried to increase the max perm size, but did
>>>> not see an improvement.
>>>>
>>>> Any suggestions to help diagnose this?
>>>>
>>>> I have the following:
>>>>
>>>> Flink 1.6.0 (same with 1.5.1)
>>>> Azure AKS with Kubernetes 1.11
>>>> State management using RocksDB with checkpoints stored in Azure Data
>>>> Lake
>>>>
>>>> Thanks!
>>>>
>>>> Bruno
>>>>
>>>>

Re: Job Manager killed by Kubernetes during recovery

Posted by Bruno Aranda <ba...@apache.org>.

Actually, I have found the issue. It was a simple thing, really, once you
know it of course.

It was caused by the livenessProbe kicking in too early. For a Flink
cluster with several jobs, the default 30 seconds I was using (after using
the Flink helm chart in the examples) was not enough to let the job manager
to fully recover and start. Increasing that, fixes the issue.

I ended up with a job manager with 4000Gi as limit, 3000Gi requested, and
configured to use 2048Gb. So I guess that was a red herring for me.

Managed to see what was going on by using the kubectl "describe" action,
where it was clearly indicated as an event.

Thanks Vino and Till for your time!

Bruno

On Tue, 21 Aug 2018 at 10:21 Till Rohrmann <tr...@apache.org> wrote:

> Hi Bruno,
>
> in order to debug this problem we would need a bit more information. In
> particular, the logs of the cluster entrypoint and your K8s deployment
> specification would be helpful. If you have some memory limits specified
> these would also be interesting to know.
>
> Cheers,
> Till
>
> On Sun, Aug 19, 2018 at 2:43 PM vino yang <ya...@gmail.com> wrote:
>
>> Hi Bruno,
>>
>> Ping Till for you, he may give you some useful information.
>>
>> Thanks, vino.
>>
>> Bruno Aranda <ba...@apache.org> 于2018年8月19日周日 上午6:57写道：
>>
>>> Hi,
>>>
>>> I am experiencing an issue when a job manager is trying to recover using
>>> a HA setup. When the job manager starts again and tries to resume from the
>>> last checkpoints, it gets killed by Kubernetes (I guess), since I can see
>>> the following in the logs while the jobs are deployed:
>>>
>>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>>> RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
>>>
>>> I am requesting enough memory for it, 3000Gi, and it is configured to
>>> use 2048Gb of memory. I have tried to increase the max perm size, but did
>>> not see an improvement.
>>>
>>> Any suggestions to help diagnose this?
>>>
>>> I have the following:
>>>
>>> Flink 1.6.0 (same with 1.5.1)
>>> Azure AKS with Kubernetes 1.11
>>> State management using RocksDB with checkpoints stored in Azure Data Lake
>>>
>>> Thanks!
>>>
>>> Bruno
>>>
>>>

Re: Job Manager killed by Kubernetes during recovery

Posted by Till Rohrmann <tr...@apache.org>.

Hi Bruno,

in order to debug this problem we would need a bit more information. In
particular, the logs of the cluster entrypoint and your K8s deployment
specification would be helpful. If you have some memory limits specified
these would also be interesting to know.

Cheers,
Till

On Sun, Aug 19, 2018 at 2:43 PM vino yang <ya...@gmail.com> wrote:

> Hi Bruno,
>
> Ping Till for you, he may give you some useful information.
>
> Thanks, vino.
>
> Bruno Aranda <ba...@apache.org> 于2018年8月19日周日 上午6:57写道：
>
>> Hi,
>>
>> I am experiencing an issue when a job manager is trying to recover using
>> a HA setup. When the job manager starts again and tries to resume from the
>> last checkpoints, it gets killed by Kubernetes (I guess), since I can see
>> the following in the logs while the jobs are deployed:
>>
>> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
>> RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
>>
>> I am requesting enough memory for it, 3000Gi, and it is configured to use
>> 2048Gb of memory. I have tried to increase the max perm size, but did not
>> see an improvement.
>>
>> Any suggestions to help diagnose this?
>>
>> I have the following:
>>
>> Flink 1.6.0 (same with 1.5.1)
>> Azure AKS with Kubernetes 1.11
>> State management using RocksDB with checkpoints stored in Azure Data Lake
>>
>> Thanks!
>>
>> Bruno
>>
>>

Re: Job Manager killed by Kubernetes during recovery

Posted by vino yang <ya...@gmail.com>.

Hi Bruno,

Ping Till for you, he may give you some useful information.

Thanks, vino.

Bruno Aranda <ba...@apache.org> 于2018年8月19日周日 上午6:57写道：

> Hi,
>
> I am experiencing an issue when a job manager is trying to recover using a
> HA setup. When the job manager starts again and tries to resume from the
> last checkpoints, it gets killed by Kubernetes (I guess), since I can see
> the following in the logs while the jobs are deployed:
>
> INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -
> RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
>
> I am requesting enough memory for it, 3000Gi, and it is configured to use
> 2048Gb of memory. I have tried to increase the max perm size, but did not
> see an improvement.
>
> Any suggestions to help diagnose this?
>
> I have the following:
>
> Flink 1.6.0 (same with 1.5.1)
> Azure AKS with Kubernetes 1.11
> State management using RocksDB with checkpoints stored in Azure Data Lake
>
> Thanks!
>
> Bruno
>
>