You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Peter Schrott <pe...@bluerootlabs.io> on 2022/04/21 16:08:10 UTC

Jobmanager trying to be registered for Zombie Job

Hi Flink-Users,

I am not sure if this does something to my cluster or not. But since
updating to Flink 1.15 (atm rc4) I find the following logs:

INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762@akka.tcp://
flink@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 for job
5566648d9b1aac6c1a1b78187fd56975.

as many times as number of parallelisms (here 10 times). These logs are
triggered every 5 minutes.

Then they are followed by:

INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
@akka.tcp://flink@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
failed.

also 10 log entries.

I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975), it
was a long-running sql streaming job, started on Apr 13th on a standalone
cluster. After some recovery attempts it finally failed (using the failover
strategy) on the 20th Apr (yesterday) for good. Then those logs started to
appear. Now there was no other job running on my cluster anymore but the
logs appeared every 5 minutes until I restarted this jobmanager service.

This job was just an example, it happens to other jobs too.

It's just INFO logs but it does not look healthy either.

Thanks & Best
Peter

Re: Jobmanager trying to be registered for Zombie Job

Posted by Matthias Pohl <ma...@ververica.com>.

Hi Peter,
based on our analysis the issue already existed before 1.15, yes. We
couldn't come up with any other reasoning. It was just never reported... or
noticing an older ticket.

Matthias

On Mon, Apr 25, 2022 at 6:21 PM Peter Schrott <pe...@bluerootlabs.io> wrote:

> Hi Matthias,
>
> You are welcome & thanks a lot for your help too!
>
> It's not quite clear to me, the bug was already there since 1.13.6 but not
> reported yet (FLINK-27354 is a new ticket)?
>
> Best, Peter
>
>
> On Mon, Apr 25, 2022 at 5:48 PM Matthias Pohl <ma...@ververica.com>
> wrote:
>
>> Thanks again, Peter for sharing your logs. I looked into the issue with
>> the help of Chesnay. Essentially, it's FLINK-27354 [1] that is causing this
>> issue. We couldn't come up with a reason why it should have popped up just
>> now with 1.15. The bug itself is already present in 1.14. You can find more
>> details on the investigation in FLINK-27354 [1] itself.
>>
>> Best,
>> Matthias
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-27354
>>
>> On Mon, Apr 25, 2022 at 2:00 PM Matthias Pohl <ma...@ververica.com>
>> wrote:
>>
>>> Thanks Peter, we're looking into it...
>>>
>>> On Mon, Apr 25, 2022 at 11:54 AM Peter Schrott <pe...@bluerootlabs.io>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> sorry for the late reply. It took me quite some time to get the logs
>>>> out of the system. I have attached them now.
>>>>
>>>> Its logs of 2 jobmanagers and 2 taskamangers. It can be seen on jm 1
>>>> that the job starts crashing and recovering a few times. This happens
>>>> until 2022-04-20 12:12:14,607. After that the above described behavior can
>>>> be seen.
>>>>
>>>> I hope this helps.
>>>>
>>>> Best, Peter
>>>>
>>>> On Fri, Apr 22, 2022 at 12:06 PM Matthias Pohl <ma...@ververica.com>
>>>> wrote:
>>>>
>>>>> FYI: I created FLINK-27354 [1] to cover the issue of retrying to
>>>>> connect to the RM while shutting down the JobMaster.
>>>>>
>>>>> This doesn't explain your issue though, Peter. It's still unclear why
>>>>> the JobMaster is still around as stated in my previous email.
>>>>>
>>>>> Matthias
>>>>>
>>>>> [1] https://issues.apache.org/jira/browse/FLINK-27354
>>>>>
>>>>> On Fri, Apr 22, 2022 at 11:54 AM Matthias Pohl <ma...@ververica.com>
>>>>> wrote:
>>>>>
>>>>>> Just by looking through the code, it appears that these logs could be
>>>>>> produced while stopping the job. The ResourceManager sends a confirmation
>>>>>> of the JobMaster being disconnected at the end back to the JobMaster. If
>>>>>> the JobMaster is still around to process the request, it would try to
>>>>>> reconnect (I'd consider that a bug because the JobMaster is in shutdown
>>>>>> mode already and wouldn't need to re-establish the connection). This method
>>>>>> would have been swallowed otherwise if the JobMaster was already terminated.
>>>>>>
>>>>>> The only explanation I can come up with right now (without having any
>>>>>> logs) is that stopping the JobMaster didn't finish for some reason. For
>>>>>> that it would be helpful to look at the logs to see whether there is some
>>>>>> other issue that causes the JobMaster to stop entirely.
>>>>>>
>>>>>> On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl <
>>>>>> matthias@ververica.com> wrote:
>>>>>>
>>>>>>> ...if possible it would be good to get debug rather than only info
>>>>>>> logs. Did you encounter anything odd in the TaskManager logs as well.
>>>>>>> Sharing those might be of value as well.
>>>>>>>
>>>>>>> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl <
>>>>>>> matthias@ververica.com> wrote:
>>>>>>>
>>>>>>>> Hi Peter,
>>>>>>>> thanks for sharing. That doesn't sound right. May you provide the
>>>>>>>> entire jobmanager logs?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Matthias
>>>>>>>>
>>>>>>>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott <
>>>>>>>> peter@bluerootlabs.io> wrote:
>>>>>>>>
>>>>>>>>> Hi Flink-Users,
>>>>>>>>>
>>>>>>>>> I am not sure if this does something to my cluster or not. But
>>>>>>>>> since updating to Flink 1.15 (atm rc4) I find the following logs:
>>>>>>>>>
>>>>>>>>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>>>>>>> @akka.tcp://
>>>>>>>>> flink@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 for
>>>>>>>>> job 5566648d9b1aac6c1a1b78187fd56975.
>>>>>>>>>
>>>>>>>>> as many times as number of parallelisms (here 10 times). These
>>>>>>>>> logs are triggered every 5 minutes.
>>>>>>>>>
>>>>>>>>> Then they are followed by:
>>>>>>>>>
>>>>>>>>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>>>>>>> @akka.tcp://
>>>>>>>>> flink@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 failed.
>>>>>>>>>
>>>>>>>>> also 10 log entries.
>>>>>>>>>
>>>>>>>>> I followed the lifetime of the job (
>>>>>>>>> 5566648d9b1aac6c1a1b78187fd56975), it was a long-running sql
>>>>>>>>> streaming job, started on Apr 13th on a standalone cluster. After some
>>>>>>>>> recovery attempts it finally failed (using the failover strategy) on the
>>>>>>>>> 20th Apr (yesterday) for good. Then those logs started to appear. Now there
>>>>>>>>> was no other job running on my cluster anymore but the logs appeared every
>>>>>>>>> 5 minutes until I restarted this jobmanager service.
>>>>>>>>>
>>>>>>>>> This job was just an example, it happens to other jobs too.
>>>>>>>>>
>>>>>>>>> It's just INFO logs but it does not look healthy either.
>>>>>>>>>
>>>>>>>>> Thanks & Best
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>

Re: Jobmanager trying to be registered for Zombie Job

Posted by Peter Schrott <pe...@bluerootlabs.io>.

Hi Matthias,

You are welcome & thanks a lot for your help too!

It's not quite clear to me, the bug was already there since 1.13.6 but not
reported yet (FLINK-27354 is a new ticket)?

Best, Peter


On Mon, Apr 25, 2022 at 5:48 PM Matthias Pohl <ma...@ververica.com>
wrote:

> Thanks again, Peter for sharing your logs. I looked into the issue with
> the help of Chesnay. Essentially, it's FLINK-27354 [1] that is causing this
> issue. We couldn't come up with a reason why it should have popped up just
> now with 1.15. The bug itself is already present in 1.14. You can find more
> details on the investigation in FLINK-27354 [1] itself.
>
> Best,
> Matthias
>
> [1] https://issues.apache.org/jira/browse/FLINK-27354
>
> On Mon, Apr 25, 2022 at 2:00 PM Matthias Pohl <ma...@ververica.com>
> wrote:
>
>> Thanks Peter, we're looking into it...
>>
>> On Mon, Apr 25, 2022 at 11:54 AM Peter Schrott <pe...@bluerootlabs.io>
>> wrote:
>>
>>> Hi,
>>>
>>> sorry for the late reply. It took me quite some time to get the logs out
>>> of the system. I have attached them now.
>>>
>>> Its logs of 2 jobmanagers and 2 taskamangers. It can be seen on jm 1
>>> that the job starts crashing and recovering a few times. This happens
>>> until 2022-04-20 12:12:14,607. After that the above described behavior can
>>> be seen.
>>>
>>> I hope this helps.
>>>
>>> Best, Peter
>>>
>>> On Fri, Apr 22, 2022 at 12:06 PM Matthias Pohl <ma...@ververica.com>
>>> wrote:
>>>
>>>> FYI: I created FLINK-27354 [1] to cover the issue of retrying to
>>>> connect to the RM while shutting down the JobMaster.
>>>>
>>>> This doesn't explain your issue though, Peter. It's still unclear why
>>>> the JobMaster is still around as stated in my previous email.
>>>>
>>>> Matthias
>>>>
>>>> [1] https://issues.apache.org/jira/browse/FLINK-27354
>>>>
>>>> On Fri, Apr 22, 2022 at 11:54 AM Matthias Pohl <ma...@ververica.com>
>>>> wrote:
>>>>
>>>>> Just by looking through the code, it appears that these logs could be
>>>>> produced while stopping the job. The ResourceManager sends a confirmation
>>>>> of the JobMaster being disconnected at the end back to the JobMaster. If
>>>>> the JobMaster is still around to process the request, it would try to
>>>>> reconnect (I'd consider that a bug because the JobMaster is in shutdown
>>>>> mode already and wouldn't need to re-establish the connection). This method
>>>>> would have been swallowed otherwise if the JobMaster was already terminated.
>>>>>
>>>>> The only explanation I can come up with right now (without having any
>>>>> logs) is that stopping the JobMaster didn't finish for some reason. For
>>>>> that it would be helpful to look at the logs to see whether there is some
>>>>> other issue that causes the JobMaster to stop entirely.
>>>>>
>>>>> On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl <ma...@ververica.com>
>>>>> wrote:
>>>>>
>>>>>> ...if possible it would be good to get debug rather than only info
>>>>>> logs. Did you encounter anything odd in the TaskManager logs as well.
>>>>>> Sharing those might be of value as well.
>>>>>>
>>>>>> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl <ma...@ververica.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Peter,
>>>>>>> thanks for sharing. That doesn't sound right. May you provide the
>>>>>>> entire jobmanager logs?
>>>>>>>
>>>>>>> Best,
>>>>>>> Matthias
>>>>>>>
>>>>>>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott <pe...@bluerootlabs.io>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Flink-Users,
>>>>>>>>
>>>>>>>> I am not sure if this does something to my cluster or not. But
>>>>>>>> since updating to Flink 1.15 (atm rc4) I find the following logs:
>>>>>>>>
>>>>>>>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>>>>>> @akka.tcp://
>>>>>>>> flink@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 for job
>>>>>>>> 5566648d9b1aac6c1a1b78187fd56975.
>>>>>>>>
>>>>>>>> as many times as number of parallelisms (here 10 times). These logs
>>>>>>>> are triggered every 5 minutes.
>>>>>>>>
>>>>>>>> Then they are followed by:
>>>>>>>>
>>>>>>>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>>>>>> @akka.tcp://
>>>>>>>> flink@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 failed.
>>>>>>>>
>>>>>>>> also 10 log entries.
>>>>>>>>
>>>>>>>> I followed the lifetime of the job (
>>>>>>>> 5566648d9b1aac6c1a1b78187fd56975), it was a long-running sql
>>>>>>>> streaming job, started on Apr 13th on a standalone cluster. After some
>>>>>>>> recovery attempts it finally failed (using the failover strategy) on the
>>>>>>>> 20th Apr (yesterday) for good. Then those logs started to appear. Now there
>>>>>>>> was no other job running on my cluster anymore but the logs appeared every
>>>>>>>> 5 minutes until I restarted this jobmanager service.
>>>>>>>>
>>>>>>>> This job was just an example, it happens to other jobs too.
>>>>>>>>
>>>>>>>> It's just INFO logs but it does not look healthy either.
>>>>>>>>
>>>>>>>> Thanks & Best
>>>>>>>> Peter
>>>>>>>>
>>>>>>>
>
> --
>
> Matthias Pohl | Engineer
>
> Follow us @VervericaData Ververica <https://www.ververica.com/>
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton
> Wehner
>

Re: Jobmanager trying to be registered for Zombie Job

Posted by Matthias Pohl <ma...@ververica.com>.

Thanks again, Peter for sharing your logs. I looked into the issue with the
help of Chesnay. Essentially, it's FLINK-27354 [1] that is causing this
issue. We couldn't come up with a reason why it should have popped up just
now with 1.15. The bug itself is already present in 1.14. You can find more
details on the investigation in FLINK-27354 [1] itself.

Best,
Matthias

[1] https://issues.apache.org/jira/browse/FLINK-27354

On Mon, Apr 25, 2022 at 2:00 PM Matthias Pohl <ma...@ververica.com>
wrote:

> Thanks Peter, we're looking into it...
>
> On Mon, Apr 25, 2022 at 11:54 AM Peter Schrott <pe...@bluerootlabs.io>
> wrote:
>
>> Hi,
>>
>> sorry for the late reply. It took me quite some time to get the logs out
>> of the system. I have attached them now.
>>
>> Its logs of 2 jobmanagers and 2 taskamangers. It can be seen on jm 1 that
>> the job starts crashing and recovering a few times. This happens
>> until 2022-04-20 12:12:14,607. After that the above described behavior can
>> be seen.
>>
>> I hope this helps.
>>
>> Best, Peter
>>
>> On Fri, Apr 22, 2022 at 12:06 PM Matthias Pohl <ma...@ververica.com>
>> wrote:
>>
>>> FYI: I created FLINK-27354 [1] to cover the issue of retrying to connect
>>> to the RM while shutting down the JobMaster.
>>>
>>> This doesn't explain your issue though, Peter. It's still unclear why
>>> the JobMaster is still around as stated in my previous email.
>>>
>>> Matthias
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-27354
>>>
>>> On Fri, Apr 22, 2022 at 11:54 AM Matthias Pohl <ma...@ververica.com>
>>> wrote:
>>>
>>>> Just by looking through the code, it appears that these logs could be
>>>> produced while stopping the job. The ResourceManager sends a confirmation
>>>> of the JobMaster being disconnected at the end back to the JobMaster. If
>>>> the JobMaster is still around to process the request, it would try to
>>>> reconnect (I'd consider that a bug because the JobMaster is in shutdown
>>>> mode already and wouldn't need to re-establish the connection). This method
>>>> would have been swallowed otherwise if the JobMaster was already terminated.
>>>>
>>>> The only explanation I can come up with right now (without having any
>>>> logs) is that stopping the JobMaster didn't finish for some reason. For
>>>> that it would be helpful to look at the logs to see whether there is some
>>>> other issue that causes the JobMaster to stop entirely.
>>>>
>>>> On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl <ma...@ververica.com>
>>>> wrote:
>>>>
>>>>> ...if possible it would be good to get debug rather than only info
>>>>> logs. Did you encounter anything odd in the TaskManager logs as well.
>>>>> Sharing those might be of value as well.
>>>>>
>>>>> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl <ma...@ververica.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Peter,
>>>>>> thanks for sharing. That doesn't sound right. May you provide the
>>>>>> entire jobmanager logs?
>>>>>>
>>>>>> Best,
>>>>>> Matthias
>>>>>>
>>>>>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott <pe...@bluerootlabs.io>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Flink-Users,
>>>>>>>
>>>>>>> I am not sure if this does something to my cluster or not. But since
>>>>>>> updating to Flink 1.15 (atm rc4) I find the following logs:
>>>>>>>
>>>>>>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>>>>> @akka.tcp://
>>>>>>> flink@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 for job
>>>>>>> 5566648d9b1aac6c1a1b78187fd56975.
>>>>>>>
>>>>>>> as many times as number of parallelisms (here 10 times). These logs
>>>>>>> are triggered every 5 minutes.
>>>>>>>
>>>>>>> Then they are followed by:
>>>>>>>
>>>>>>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>>>>> @akka.tcp://
>>>>>>> flink@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 failed.
>>>>>>>
>>>>>>> also 10 log entries.
>>>>>>>
>>>>>>> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975),
>>>>>>> it was a long-running sql streaming job, started on Apr 13th on a
>>>>>>> standalone cluster. After some recovery attempts it finally failed (using
>>>>>>> the failover strategy) on the 20th Apr (yesterday) for good. Then those
>>>>>>> logs started to appear. Now there was no other job running on my cluster
>>>>>>> anymore but the logs appeared every 5 minutes until I restarted this
>>>>>>> jobmanager service.
>>>>>>>
>>>>>>> This job was just an example, it happens to other jobs too.
>>>>>>>
>>>>>>> It's just INFO logs but it does not look healthy either.
>>>>>>>
>>>>>>> Thanks & Best
>>>>>>> Peter
>>>>>>>
>>>>>>

-- 

Matthias Pohl | Engineer

Follow us @VervericaData Ververica <https://www.ververica.com/>

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton
Wehner

Re: Jobmanager trying to be registered for Zombie Job

Posted by Matthias Pohl <ma...@ververica.com>.

Thanks Peter, we're looking into it...

On Mon, Apr 25, 2022 at 11:54 AM Peter Schrott <pe...@bluerootlabs.io>
wrote:

> Hi,
>
> sorry for the late reply. It took me quite some time to get the logs out
> of the system. I have attached them now.
>
> Its logs of 2 jobmanagers and 2 taskamangers. It can be seen on jm 1 that
> the job starts crashing and recovering a few times. This happens
> until 2022-04-20 12:12:14,607. After that the above described behavior can
> be seen.
>
> I hope this helps.
>
> Best, Peter
>
> On Fri, Apr 22, 2022 at 12:06 PM Matthias Pohl <ma...@ververica.com>
> wrote:
>
>> FYI: I created FLINK-27354 [1] to cover the issue of retrying to connect
>> to the RM while shutting down the JobMaster.
>>
>> This doesn't explain your issue though, Peter. It's still unclear why the
>> JobMaster is still around as stated in my previous email.
>>
>> Matthias
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-27354
>>
>> On Fri, Apr 22, 2022 at 11:54 AM Matthias Pohl <ma...@ververica.com>
>> wrote:
>>
>>> Just by looking through the code, it appears that these logs could be
>>> produced while stopping the job. The ResourceManager sends a confirmation
>>> of the JobMaster being disconnected at the end back to the JobMaster. If
>>> the JobMaster is still around to process the request, it would try to
>>> reconnect (I'd consider that a bug because the JobMaster is in shutdown
>>> mode already and wouldn't need to re-establish the connection). This method
>>> would have been swallowed otherwise if the JobMaster was already terminated.
>>>
>>> The only explanation I can come up with right now (without having any
>>> logs) is that stopping the JobMaster didn't finish for some reason. For
>>> that it would be helpful to look at the logs to see whether there is some
>>> other issue that causes the JobMaster to stop entirely.
>>>
>>> On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl <ma...@ververica.com>
>>> wrote:
>>>
>>>> ...if possible it would be good to get debug rather than only info
>>>> logs. Did you encounter anything odd in the TaskManager logs as well.
>>>> Sharing those might be of value as well.
>>>>
>>>> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl <ma...@ververica.com>
>>>> wrote:
>>>>
>>>>> Hi Peter,
>>>>> thanks for sharing. That doesn't sound right. May you provide the
>>>>> entire jobmanager logs?
>>>>>
>>>>> Best,
>>>>> Matthias
>>>>>
>>>>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott <pe...@bluerootlabs.io>
>>>>> wrote:
>>>>>
>>>>>> Hi Flink-Users,
>>>>>>
>>>>>> I am not sure if this does something to my cluster or not. But since
>>>>>> updating to Flink 1.15 (atm rc4) I find the following logs:
>>>>>>
>>>>>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>>>> @akka.tcp://
>>>>>> flink@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 for job
>>>>>> 5566648d9b1aac6c1a1b78187fd56975.
>>>>>>
>>>>>> as many times as number of parallelisms (here 10 times). These logs
>>>>>> are triggered every 5 minutes.
>>>>>>
>>>>>> Then they are followed by:
>>>>>>
>>>>>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>>>> @akka.tcp://
>>>>>> flink@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 failed.
>>>>>>
>>>>>> also 10 log entries.
>>>>>>
>>>>>> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975),
>>>>>> it was a long-running sql streaming job, started on Apr 13th on a
>>>>>> standalone cluster. After some recovery attempts it finally failed (using
>>>>>> the failover strategy) on the 20th Apr (yesterday) for good. Then those
>>>>>> logs started to appear. Now there was no other job running on my cluster
>>>>>> anymore but the logs appeared every 5 minutes until I restarted this
>>>>>> jobmanager service.
>>>>>>
>>>>>> This job was just an example, it happens to other jobs too.
>>>>>>
>>>>>> It's just INFO logs but it does not look healthy either.
>>>>>>
>>>>>> Thanks & Best
>>>>>> Peter
>>>>>>
>>>>>

Re: Jobmanager trying to be registered for Zombie Job

Posted by Peter Schrott <pe...@bluerootlabs.io>.

Hi,

sorry for the late reply. It took me quite some time to get the logs out of
the system. I have attached them now.

Its logs of 2 jobmanagers and 2 taskamangers. It can be seen on jm 1 that
the job starts crashing and recovering a few times. This happens
until 2022-04-20 12:12:14,607. After that the above described behavior can
be seen.

I hope this helps.

Best, Peter

On Fri, Apr 22, 2022 at 12:06 PM Matthias Pohl <ma...@ververica.com>
wrote:

> FYI: I created FLINK-27354 [1] to cover the issue of retrying to connect
> to the RM while shutting down the JobMaster.
>
> This doesn't explain your issue though, Peter. It's still unclear why the
> JobMaster is still around as stated in my previous email.
>
> Matthias
>
> [1] https://issues.apache.org/jira/browse/FLINK-27354
>
> On Fri, Apr 22, 2022 at 11:54 AM Matthias Pohl <ma...@ververica.com>
> wrote:
>
>> Just by looking through the code, it appears that these logs could be
>> produced while stopping the job. The ResourceManager sends a confirmation
>> of the JobMaster being disconnected at the end back to the JobMaster. If
>> the JobMaster is still around to process the request, it would try to
>> reconnect (I'd consider that a bug because the JobMaster is in shutdown
>> mode already and wouldn't need to re-establish the connection). This method
>> would have been swallowed otherwise if the JobMaster was already terminated.
>>
>> The only explanation I can come up with right now (without having any
>> logs) is that stopping the JobMaster didn't finish for some reason. For
>> that it would be helpful to look at the logs to see whether there is some
>> other issue that causes the JobMaster to stop entirely.
>>
>> On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl <ma...@ververica.com>
>> wrote:
>>
>>> ...if possible it would be good to get debug rather than only info logs.
>>> Did you encounter anything odd in the TaskManager logs as well. Sharing
>>> those might be of value as well.
>>>
>>> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl <ma...@ververica.com>
>>> wrote:
>>>
>>>> Hi Peter,
>>>> thanks for sharing. That doesn't sound right. May you provide the
>>>> entire jobmanager logs?
>>>>
>>>> Best,
>>>> Matthias
>>>>
>>>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott <pe...@bluerootlabs.io>
>>>> wrote:
>>>>
>>>>> Hi Flink-Users,
>>>>>
>>>>> I am not sure if this does something to my cluster or not. But since
>>>>> updating to Flink 1.15 (atm rc4) I find the following logs:
>>>>>
>>>>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>>> @akka.tcp://flink@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
>>>>> for job 5566648d9b1aac6c1a1b78187fd56975.
>>>>>
>>>>> as many times as number of parallelisms (here 10 times). These logs
>>>>> are triggered every 5 minutes.
>>>>>
>>>>> Then they are followed by:
>>>>>
>>>>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>>> @akka.tcp://flink@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
>>>>> failed.
>>>>>
>>>>> also 10 log entries.
>>>>>
>>>>> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975),
>>>>> it was a long-running sql streaming job, started on Apr 13th on a
>>>>> standalone cluster. After some recovery attempts it finally failed (using
>>>>> the failover strategy) on the 20th Apr (yesterday) for good. Then those
>>>>> logs started to appear. Now there was no other job running on my cluster
>>>>> anymore but the logs appeared every 5 minutes until I restarted this
>>>>> jobmanager service.
>>>>>
>>>>> This job was just an example, it happens to other jobs too.
>>>>>
>>>>> It's just INFO logs but it does not look healthy either.
>>>>>
>>>>> Thanks & Best
>>>>> Peter
>>>>>
>>>>

Re: Jobmanager trying to be registered for Zombie Job

Posted by Matthias Pohl <ma...@ververica.com>.

FYI: I created FLINK-27354 [1] to cover the issue of retrying to connect to
the RM while shutting down the JobMaster.

This doesn't explain your issue though, Peter. It's still unclear why the
JobMaster is still around as stated in my previous email.

Matthias

[1] https://issues.apache.org/jira/browse/FLINK-27354

On Fri, Apr 22, 2022 at 11:54 AM Matthias Pohl <ma...@ververica.com>
wrote:

> Just by looking through the code, it appears that these logs could be
> produced while stopping the job. The ResourceManager sends a confirmation
> of the JobMaster being disconnected at the end back to the JobMaster. If
> the JobMaster is still around to process the request, it would try to
> reconnect (I'd consider that a bug because the JobMaster is in shutdown
> mode already and wouldn't need to re-establish the connection). This method
> would have been swallowed otherwise if the JobMaster was already terminated.
>
> The only explanation I can come up with right now (without having any
> logs) is that stopping the JobMaster didn't finish for some reason. For
> that it would be helpful to look at the logs to see whether there is some
> other issue that causes the JobMaster to stop entirely.
>
> On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl <ma...@ververica.com>
> wrote:
>
>> ...if possible it would be good to get debug rather than only info logs.
>> Did you encounter anything odd in the TaskManager logs as well. Sharing
>> those might be of value as well.
>>
>> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl <ma...@ververica.com>
>> wrote:
>>
>>> Hi Peter,
>>> thanks for sharing. That doesn't sound right. May you provide the entire
>>> jobmanager logs?
>>>
>>> Best,
>>> Matthias
>>>
>>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott <pe...@bluerootlabs.io>
>>> wrote:
>>>
>>>> Hi Flink-Users,
>>>>
>>>> I am not sure if this does something to my cluster or not. But since
>>>> updating to Flink 1.15 (atm rc4) I find the following logs:
>>>>
>>>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>> @akka.tcp://flink@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
>>>> for job 5566648d9b1aac6c1a1b78187fd56975.
>>>>
>>>> as many times as number of parallelisms (here 10 times). These logs are
>>>> triggered every 5 minutes.
>>>>
>>>> Then they are followed by:
>>>>
>>>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>> @akka.tcp://flink@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
>>>> failed.
>>>>
>>>> also 10 log entries.
>>>>
>>>> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975),
>>>> it was a long-running sql streaming job, started on Apr 13th on a
>>>> standalone cluster. After some recovery attempts it finally failed (using
>>>> the failover strategy) on the 20th Apr (yesterday) for good. Then those
>>>> logs started to appear. Now there was no other job running on my cluster
>>>> anymore but the logs appeared every 5 minutes until I restarted this
>>>> jobmanager service.
>>>>
>>>> This job was just an example, it happens to other jobs too.
>>>>
>>>> It's just INFO logs but it does not look healthy either.
>>>>
>>>> Thanks & Best
>>>> Peter
>>>>
>>>

Re: Jobmanager trying to be registered for Zombie Job

Posted by Matthias Pohl <ma...@ververica.com>.

Just by looking through the code, it appears that these logs could be
produced while stopping the job. The ResourceManager sends a confirmation
of the JobMaster being disconnected at the end back to the JobMaster. If
the JobMaster is still around to process the request, it would try to
reconnect (I'd consider that a bug because the JobMaster is in shutdown
mode already and wouldn't need to re-establish the connection). This method
would have been swallowed otherwise if the JobMaster was already terminated.

The only explanation I can come up with right now (without having any logs)
is that stopping the JobMaster didn't finish for some reason. For that it
would be helpful to look at the logs to see whether there is some other
issue that causes the JobMaster to stop entirely.

On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl <ma...@ververica.com>
wrote:

> ...if possible it would be good to get debug rather than only info logs.
> Did you encounter anything odd in the TaskManager logs as well. Sharing
> those might be of value as well.
>
> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl <ma...@ververica.com>
> wrote:
>
>> Hi Peter,
>> thanks for sharing. That doesn't sound right. May you provide the entire
>> jobmanager logs?
>>
>> Best,
>> Matthias
>>
>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott <pe...@bluerootlabs.io>
>> wrote:
>>
>>> Hi Flink-Users,
>>>
>>> I am not sure if this does something to my cluster or not. But since
>>> updating to Flink 1.15 (atm rc4) I find the following logs:
>>>
>>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>> @akka.tcp://flink@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
>>> for job 5566648d9b1aac6c1a1b78187fd56975.
>>>
>>> as many times as number of parallelisms (here 10 times). These logs are
>>> triggered every 5 minutes.
>>>
>>> Then they are followed by:
>>>
>>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>> @akka.tcp://flink@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
>>> failed.
>>>
>>> also 10 log entries.
>>>
>>> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975),
>>> it was a long-running sql streaming job, started on Apr 13th on a
>>> standalone cluster. After some recovery attempts it finally failed (using
>>> the failover strategy) on the 20th Apr (yesterday) for good. Then those
>>> logs started to appear. Now there was no other job running on my cluster
>>> anymore but the logs appeared every 5 minutes until I restarted this
>>> jobmanager service.
>>>
>>> This job was just an example, it happens to other jobs too.
>>>
>>> It's just INFO logs but it does not look healthy either.
>>>
>>> Thanks & Best
>>> Peter
>>>
>>

Re: Jobmanager trying to be registered for Zombie Job

Posted by Matthias Pohl <ma...@ververica.com>.

...if possible it would be good to get debug rather than only info logs.
Did you encounter anything odd in the TaskManager logs as well. Sharing
those might be of value as well.

On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl <ma...@ververica.com>
wrote:

> Hi Peter,
> thanks for sharing. That doesn't sound right. May you provide the entire
> jobmanager logs?
>
> Best,
> Matthias
>
> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott <pe...@bluerootlabs.io>
> wrote:
>
>> Hi Flink-Users,
>>
>> I am not sure if this does something to my cluster or not. But since
>> updating to Flink 1.15 (atm rc4) I find the following logs:
>>
>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
>> @akka.tcp://flink@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
>> for job 5566648d9b1aac6c1a1b78187fd56975.
>>
>> as many times as number of parallelisms (here 10 times). These logs are
>> triggered every 5 minutes.
>>
>> Then they are followed by:
>>
>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
>> @akka.tcp://flink@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
>> failed.
>>
>> also 10 log entries.
>>
>> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975),
>> it was a long-running sql streaming job, started on Apr 13th on a
>> standalone cluster. After some recovery attempts it finally failed (using
>> the failover strategy) on the 20th Apr (yesterday) for good. Then those
>> logs started to appear. Now there was no other job running on my cluster
>> anymore but the logs appeared every 5 minutes until I restarted this
>> jobmanager service.
>>
>> This job was just an example, it happens to other jobs too.
>>
>> It's just INFO logs but it does not look healthy either.
>>
>> Thanks & Best
>> Peter
>>
>

Re: Jobmanager trying to be registered for Zombie Job

Posted by Matthias Pohl <ma...@ververica.com>.

Hi Peter,
thanks for sharing. That doesn't sound right. May you provide the entire
jobmanager logs?

Best,
Matthias

On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott <pe...@bluerootlabs.io> wrote:

> Hi Flink-Users,
>
> I am not sure if this does something to my cluster or not. But since
> updating to Flink 1.15 (atm rc4) I find the following logs:
>
> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762@akka.tcp://
> flink@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 for job
> 5566648d9b1aac6c1a1b78187fd56975.
>
> as many times as number of parallelisms (here 10 times). These logs are
> triggered every 5 minutes.
>
> Then they are followed by:
>
> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
> @akka.tcp://flink@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
> failed.
>
> also 10 log entries.
>
> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975), it
> was a long-running sql streaming job, started on Apr 13th on a standalone
> cluster. After some recovery attempts it finally failed (using the failover
> strategy) on the 20th Apr (yesterday) for good. Then those logs started to
> appear. Now there was no other job running on my cluster anymore but the
> logs appeared every 5 minutes until I restarted this jobmanager service.
>
> This job was just an example, it happens to other jobs too.
>
> It's just INFO logs but it does not look healthy either.
>
> Thanks & Best
> Peter
>