You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Puneet Duggal <pu...@gmail.com> on 2022/10/11 22:24:56 UTC

Job Manager getting restarted while restarting task manager

Hi,

I am facing an issue where when restarting task manager after adding some configuration changes, even though task manager restarts successfully with the updated configuration change, is causing the leader job manager to restart as well. Pasting the leader job manager logs here


2022-10-11 22:11:02,207 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2022-10-11 22:11:02,411 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /<TM-IP>:35376
2022-10-11 22:11:02,413 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@<TM-IP>:35376]] Caused by: [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
2022-10-11 22:11:02,682 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /<TM-IP>:35376
2022-10-11 22:11:02,683 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@<TM-IP>:35376]] Caused by: [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
2022-10-11 22:11:12,702 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /<TM-IP>:35376
2022-10-11 22:11:12,703 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@<TM-IP>:35376]] Caused by: [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
2022-10-11 22:11:21,683 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2022-10-11 22:11:21,687 INFO  org.apache.flink.runtime.blob.BlobServer                     [] - Stopped BLOB server at 0.0.0.0:33887


Regards,
Puneet

Re: Job Manager getting restarted while restarting task manager

Posted by yu'an huang <h....@gmail.com>.

Are you able to replay this scenario? Did you accidently send killing
signal to the job mananger process?

On Thu, 13 Oct 2022 at 4:02 PM, Puneet Duggal <pu...@gmail.com>
wrote:

> Hi,
>
> We use session deployment mode with HA setup. Currently we have 3 job
> managers and 3 task managers running on flink version 1.12.1. Please find
> attached the complete job manager logs.
>
>
>
>
>
> On 13-Oct-2022, at 7:28 AM, Xintong Song <to...@gmail.com> wrote:
>
> I meant your jobmanager also received a SIGTERM signal, and you would need
> to figure out where it comes from.
>
> To be specific, this line of log:
>
>> 2022-10-11 22:11:21,683 INFO
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED
>> SIGNAL 15: SIGTERM. Shutting down as requested.
>>
>
> I believe this is from the jobmanager log, as `ClusterEntrypoint` is a
> class used by jobmanager only.
>
> Best,
> Xintong
>
>
>
> On Thu, Oct 13, 2022 at 9:06 AM yu'an huang <h....@gmail.com> wrote:
>
>> Hi,
>>
>> Which deployment mode do you use? What is the Flink version?
>> I think killing TaskManagers won't make the JobMananger restart. You can
>> provide the whole log as an attachment to investigate.
>>
>> On Wed, 12 Oct 2022 at 6:01 PM, Puneet Duggal <pu...@gmail.com>
>> wrote:
>>
>>> Hi Xintong Song,
>>>
>>> Thanks for your immediate reply. Yes, I do restart task manager via kill
>>> command and then flink restart because I have seen cases where simple flink
>>> restart does not pickup the latest configuration. But what I am confused
>>> about is why killing the task manager process and then restarting it is
>>> causing the job manager to stop and restart.
>>>
>>> Regards,
>>> Puneet
>>>
>>>
>>> On 12-Oct-2022, at 7:33 AM, Xintong Song <to...@gmail.com> wrote:
>>>
>>> The log shows that the jobmanager received a SIGTERM signal from
>>> external. Depending on how you deploy Flink, that could be a 'kill <PID>'
>>> command, or a kubernetes pod removal / eviction, etc. You may want to check
>>> where the signal came from.
>>>
>>> Best,
>>> Xintong
>>>
>>>
>>>
>>> On Wed, Oct 12, 2022 at 6:26 AM Puneet Duggal <
>>> puneetduggal1795@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am facing an issue where when restarting task manager after adding
>>>> some configuration changes, even though task manager restarts successfully
>>>> with the updated configuration change, is causing the leader job manager to
>>>> restart as well. Pasting the leader job manager logs here
>>>>
>>>>
>>>> 2022-10-11 22:11:02,207 WARN  akka.remote.ReliableDeliverySupervisor
>>>>                    [] - Association with remote system [
>>>> akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for
>>>> [50] ms. Reason: [Disassociated]
>>>> 2022-10-11 22:11:02,411 WARN
>>>> akka.remote.transport.netty.NettyTransport                   [] - Remote
>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>> refused: /<TM-IP>:35376
>>>> 2022-10-11 22:11:02,413 WARN  akka.remote.ReliableDeliverySupervisor
>>>>                    [] - Association with remote system [
>>>> akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for
>>>> [50] ms. Reason: [Association failed with [
>>>> akka.tcp://flink@<TM-IP>:35376]] Caused by:
>>>> [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
>>>> 2022-10-11 22:11:02,682 WARN
>>>> akka.remote.transport.netty.NettyTransport                   [] - Remote
>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>> refused: /<TM-IP>:35376
>>>> 2022-10-11 22:11:02,683 WARN  akka.remote.ReliableDeliverySupervisor
>>>>                    [] - Association with remote system [
>>>> akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for
>>>> [50] ms. Reason: [Association failed with [
>>>> akka.tcp://flink@<TM-IP>:35376]] Caused by:
>>>> [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
>>>> 2022-10-11 22:11:12,702 WARN
>>>> akka.remote.transport.netty.NettyTransport                   [] - Remote
>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>> refused: /<TM-IP>:35376
>>>> 2022-10-11 22:11:12,703 WARN  akka.remote.ReliableDeliverySupervisor
>>>>                    [] - Association with remote system [
>>>> akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for
>>>> [50] ms. Reason: [Association failed with [
>>>> akka.tcp://flink@<TM-IP>:35376]] Caused by:
>>>> [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
>>>> 2022-10-11 22:11:21,683 INFO
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED
>>>> SIGNAL 15: SIGTERM. Shutting down as requested.
>>>> 2022-10-11 22:11:21,687 INFO  org.apache.flink.runtime.blob.BlobServer
>>>>                    [] - Stopped BLOB server at 0.0.0.0:33887
>>>>
>>>>
>>>> Regards,
>>>> Puneet
>>>>
>>>>
>>>>
>>>
>

Re: Job Manager getting restarted while restarting task manager

Posted by Puneet Duggal <pu...@gmail.com>.

Hi,

We use session deployment mode with HA setup. Currently we have 3 job managers and 3 task managers running on flink version 1.12.1. Please find attached the complete job manager logs.






> On 13-Oct-2022, at 7:28 AM, Xintong Song <to...@gmail.com> wrote:
> 
> I meant your jobmanager also received a SIGTERM signal, and you would need to figure out where it comes from.
> 
> To be specific, this line of log:
> 2022-10-11 22:11:21,683 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
> 
> I believe this is from the jobmanager log, as `ClusterEntrypoint` is a class used by jobmanager only.
> 
> Best,
> Xintong
> 
> 
> On Thu, Oct 13, 2022 at 9:06 AM yu'an huang <h.yuan667@gmail.com <ma...@gmail.com>> wrote:
> Hi, 
> 
> Which deployment mode do you use? What is the Flink version?
> I think killing TaskManagers won't make the JobMananger restart. You can provide the whole log as an attachment to investigate.
> 
> On Wed, 12 Oct 2022 at 6:01 PM, Puneet Duggal <puneetduggal1795@gmail.com <ma...@gmail.com>> wrote:
> Hi Xintong Song,
> 
> Thanks for your immediate reply. Yes, I do restart task manager via kill command and then flink restart because I have seen cases where simple flink restart does not pickup the latest configuration. But what I am confused about is why killing the task manager process and then restarting it is causing the job manager to stop and restart.
> 
> Regards,
> Puneet
> 
> 
>> On 12-Oct-2022, at 7:33 AM, Xintong Song <tonysong820@gmail.com <ma...@gmail.com>> wrote:
>> 
>> The log shows that the jobmanager received a SIGTERM signal from external. Depending on how you deploy Flink, that could be a 'kill <PID>' command, or a kubernetes pod removal / eviction, etc. You may want to check where the signal came from.
>> 
>> Best,
>> Xintong
>> 
>> 
>> On Wed, Oct 12, 2022 at 6:26 AM Puneet Duggal <puneetduggal1795@gmail.com <ma...@gmail.com>> wrote:
>> Hi,
>> 
>> I am facing an issue where when restarting task manager after adding some configuration changes, even though task manager restarts successfully with the updated configuration change, is causing the leader job manager to restart as well. Pasting the leader job manager logs here
>> 
>> 
>> 2022-10-11 22:11:02,207 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@<TM-IP>:35376 <>] has failed, address is now gated for [50] ms. Reason: [Disassociated]
>> 2022-10-11 22:11:02,411 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /<TM-IP>:35376
>> 2022-10-11 22:11:02,413 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@<TM-IP>:35376 <>] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@<TM-IP>:35376 <>]] Caused by: [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
>> 2022-10-11 22:11:02,682 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /<TM-IP>:35376
>> 2022-10-11 22:11:02,683 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@<TM-IP>:35376 <>] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@<TM-IP>:35376 <>]] Caused by: [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
>> 2022-10-11 22:11:12,702 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /<TM-IP>:35376
>> 2022-10-11 22:11:12,703 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@<TM-IP>:35376 <>] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@<TM-IP>:35376 <>]] Caused by: [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
>> 2022-10-11 22:11:21,683 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
>> 2022-10-11 22:11:21,687 INFO  org.apache.flink.runtime.blob.BlobServer                     [] - Stopped BLOB server at 0.0.0.0:33887 <http://0.0.0.0:33887/>
>> 
>> 
>> Regards,
>> Puneet
>> 
>> 
>

Re: Job Manager getting restarted while restarting task manager

Posted by Xintong Song <to...@gmail.com>.

I meant your jobmanager also received a SIGTERM signal, and you would need
to figure out where it comes from.

To be specific, this line of log:

> 2022-10-11 22:11:21,683 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED
> SIGNAL 15: SIGTERM. Shutting down as requested.
>

I believe this is from the jobmanager log, as `ClusterEntrypoint` is a
class used by jobmanager only.

Best,

Xintong



On Thu, Oct 13, 2022 at 9:06 AM yu'an huang <h....@gmail.com> wrote:

> Hi,
>
> Which deployment mode do you use? What is the Flink version?
> I think killing TaskManagers won't make the JobMananger restart. You can
> provide the whole log as an attachment to investigate.
>
> On Wed, 12 Oct 2022 at 6:01 PM, Puneet Duggal <pu...@gmail.com>
> wrote:
>
>> Hi Xintong Song,
>>
>> Thanks for your immediate reply. Yes, I do restart task manager via kill
>> command and then flink restart because I have seen cases where simple flink
>> restart does not pickup the latest configuration. But what I am confused
>> about is why killing the task manager process and then restarting it is
>> causing the job manager to stop and restart.
>>
>> Regards,
>> Puneet
>>
>>
>> On 12-Oct-2022, at 7:33 AM, Xintong Song <to...@gmail.com> wrote:
>>
>> The log shows that the jobmanager received a SIGTERM signal from
>> external. Depending on how you deploy Flink, that could be a 'kill <PID>'
>> command, or a kubernetes pod removal / eviction, etc. You may want to check
>> where the signal came from.
>>
>> Best,
>> Xintong
>>
>>
>>
>> On Wed, Oct 12, 2022 at 6:26 AM Puneet Duggal <pu...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I am facing an issue where when restarting task manager after adding
>>> some configuration changes, even though task manager restarts successfully
>>> with the updated configuration change, is causing the leader job manager to
>>> restart as well. Pasting the leader job manager logs here
>>>
>>>
>>> 2022-10-11 22:11:02,207 WARN  akka.remote.ReliableDeliverySupervisor
>>>                    [] - Association with remote system [
>>> akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for
>>> [50] ms. Reason: [Disassociated]
>>> 2022-10-11 22:11:02,411 WARN
>>> akka.remote.transport.netty.NettyTransport                   [] - Remote
>>> connection to [null] failed with java.net.ConnectException: Connection
>>> refused: /<TM-IP>:35376
>>> 2022-10-11 22:11:02,413 WARN  akka.remote.ReliableDeliverySupervisor
>>>                    [] - Association with remote system [
>>> akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for
>>> [50] ms. Reason: [Association failed with [
>>> akka.tcp://flink@<TM-IP>:35376]] Caused by: [java.net.ConnectException:
>>> Connection refused: /<TM-IP>:35376]
>>> 2022-10-11 22:11:02,682 WARN
>>> akka.remote.transport.netty.NettyTransport                   [] - Remote
>>> connection to [null] failed with java.net.ConnectException: Connection
>>> refused: /<TM-IP>:35376
>>> 2022-10-11 22:11:02,683 WARN  akka.remote.ReliableDeliverySupervisor
>>>                    [] - Association with remote system [
>>> akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for
>>> [50] ms. Reason: [Association failed with [
>>> akka.tcp://flink@<TM-IP>:35376]] Caused by: [java.net.ConnectException:
>>> Connection refused: /<TM-IP>:35376]
>>> 2022-10-11 22:11:12,702 WARN
>>> akka.remote.transport.netty.NettyTransport                   [] - Remote
>>> connection to [null] failed with java.net.ConnectException: Connection
>>> refused: /<TM-IP>:35376
>>> 2022-10-11 22:11:12,703 WARN  akka.remote.ReliableDeliverySupervisor
>>>                    [] - Association with remote system [
>>> akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for
>>> [50] ms. Reason: [Association failed with [
>>> akka.tcp://flink@<TM-IP>:35376]] Caused by: [java.net.ConnectException:
>>> Connection refused: /<TM-IP>:35376]
>>> 2022-10-11 22:11:21,683 INFO
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED
>>> SIGNAL 15: SIGTERM. Shutting down as requested.
>>> 2022-10-11 22:11:21,687 INFO  org.apache.flink.runtime.blob.BlobServer
>>>                    [] - Stopped BLOB server at 0.0.0.0:33887
>>>
>>>
>>> Regards,
>>> Puneet
>>>
>>>
>>>
>>

Re: Job Manager getting restarted while restarting task manager

Posted by yu'an huang <h....@gmail.com>.

Hi,

Which deployment mode do you use? What is the Flink version?
I think killing TaskManagers won't make the JobMananger restart. You can
provide the whole log as an attachment to investigate.

On Wed, 12 Oct 2022 at 6:01 PM, Puneet Duggal <pu...@gmail.com>
wrote:

> Hi Xintong Song,
>
> Thanks for your immediate reply. Yes, I do restart task manager via kill
> command and then flink restart because I have seen cases where simple flink
> restart does not pickup the latest configuration. But what I am confused
> about is why killing the task manager process and then restarting it is
> causing the job manager to stop and restart.
>
> Regards,
> Puneet
>
>
> On 12-Oct-2022, at 7:33 AM, Xintong Song <to...@gmail.com> wrote:
>
> The log shows that the jobmanager received a SIGTERM signal from external.
> Depending on how you deploy Flink, that could be a 'kill <PID>' command, or
> a kubernetes pod removal / eviction, etc. You may want to check where the
> signal came from.
>
> Best,
> Xintong
>
>
>
> On Wed, Oct 12, 2022 at 6:26 AM Puneet Duggal <pu...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am facing an issue where when restarting task manager after adding some
>> configuration changes, even though task manager restarts successfully with
>> the updated configuration change, is causing the leader job manager to
>> restart as well. Pasting the leader job manager logs here
>>
>>
>> 2022-10-11 22:11:02,207 WARN  akka.remote.ReliableDeliverySupervisor
>>                  [] - Association with remote system [
>> akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for
>> [50] ms. Reason: [Disassociated]
>> 2022-10-11 22:11:02,411 WARN  akka.remote.transport.netty.NettyTransport
>>                  [] - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: /<TM-IP>:35376
>> 2022-10-11 22:11:02,413 WARN  akka.remote.ReliableDeliverySupervisor
>>                  [] - Association with remote system [
>> akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for
>> [50] ms. Reason: [Association failed with [akka.tcp://flink@<TM-IP>:35376]]
>> Caused by: [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
>> 2022-10-11 22:11:02,682 WARN  akka.remote.transport.netty.NettyTransport
>>                  [] - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: /<TM-IP>:35376
>> 2022-10-11 22:11:02,683 WARN  akka.remote.ReliableDeliverySupervisor
>>                  [] - Association with remote system [
>> akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for
>> [50] ms. Reason: [Association failed with [akka.tcp://flink@<TM-IP>:35376]]
>> Caused by: [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
>> 2022-10-11 22:11:12,702 WARN  akka.remote.transport.netty.NettyTransport
>>                  [] - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: /<TM-IP>:35376
>> 2022-10-11 22:11:12,703 WARN  akka.remote.ReliableDeliverySupervisor
>>                  [] - Association with remote system [
>> akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for
>> [50] ms. Reason: [Association failed with [akka.tcp://flink@<TM-IP>:35376]]
>> Caused by: [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
>> 2022-10-11 22:11:21,683 INFO
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED
>> SIGNAL 15: SIGTERM. Shutting down as requested.
>> 2022-10-11 22:11:21,687 INFO  org.apache.flink.runtime.blob.BlobServer
>>                  [] - Stopped BLOB server at 0.0.0.0:33887
>>
>>
>> Regards,
>> Puneet
>>
>>
>>
>

Re: Job Manager getting restarted while restarting task manager

Posted by Puneet Duggal <pu...@gmail.com>.

Hi Xintong Song,

Thanks for your immediate reply. Yes, I do restart task manager via kill command and then flink restart because I have seen cases where simple flink restart does not pickup the latest configuration. But what I am confused about is why killing the task manager process and then restarting it is causing the job manager to stop and restart.

Regards,
Puneet

> On 12-Oct-2022, at 7:33 AM, Xintong Song <to...@gmail.com> wrote:
> 
> The log shows that the jobmanager received a SIGTERM signal from external. Depending on how you deploy Flink, that could be a 'kill <PID>' command, or a kubernetes pod removal / eviction, etc. You may want to check where the signal came from.
> 
> Best,
> Xintong
> 
> 
> On Wed, Oct 12, 2022 at 6:26 AM Puneet Duggal <puneetduggal1795@gmail.com <ma...@gmail.com>> wrote:
> Hi,
> 
> I am facing an issue where when restarting task manager after adding some configuration changes, even though task manager restarts successfully with the updated configuration change, is causing the leader job manager to restart as well. Pasting the leader job manager logs here
> 
> 
> 2022-10-11 22:11:02,207 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for [50] ms. Reason: [Disassociated]
> 2022-10-11 22:11:02,411 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /<TM-IP>:35376
> 2022-10-11 22:11:02,413 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@<TM-IP>:35376]] Caused by: [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
> 2022-10-11 22:11:02,682 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /<TM-IP>:35376
> 2022-10-11 22:11:02,683 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@<TM-IP>:35376]] Caused by: [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
> 2022-10-11 22:11:12,702 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /<TM-IP>:35376
> 2022-10-11 22:11:12,703 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@<TM-IP>:35376] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@<TM-IP>:35376]] Caused by: [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
> 2022-10-11 22:11:21,683 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
> 2022-10-11 22:11:21,687 INFO  org.apache.flink.runtime.blob.BlobServer                     [] - Stopped BLOB server at 0.0.0.0:33887 <http://0.0.0.0:33887/>
> 
> 
> Regards,
> Puneet
> 
>

Re: Job Manager getting restarted while restarting task manager

Posted by Xintong Song <to...@gmail.com>.

The log shows that the jobmanager received a SIGTERM signal from external.
Depending on how you deploy Flink, that could be a 'kill <PID>' command, or
a kubernetes pod removal / eviction, etc. You may want to check where the
signal came from.

Best,

Xintong



On Wed, Oct 12, 2022 at 6:26 AM Puneet Duggal <pu...@gmail.com>
wrote:

> Hi,
>
> I am facing an issue where when restarting task manager after adding some
> configuration changes, even though task manager restarts successfully with
> the updated configuration change, is causing the leader job manager to
> restart as well. Pasting the leader job manager logs here
>
>
> 2022-10-11 22:11:02,207 WARN  akka.remote.ReliableDeliverySupervisor
>                  [] - Association with remote system [akka.tcp://flink@<TM-IP>:35376]
> has failed, address is now gated for [50] ms. Reason: [Disassociated]
> 2022-10-11 22:11:02,411 WARN  akka.remote.transport.netty.NettyTransport
>                  [] - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: /<TM-IP>:35376
> 2022-10-11 22:11:02,413 WARN  akka.remote.ReliableDeliverySupervisor
>                  [] - Association with remote system [akka.tcp://flink@<TM-IP>:35376]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@<TM-IP>:35376]] Caused by:
> [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
> 2022-10-11 22:11:02,682 WARN  akka.remote.transport.netty.NettyTransport
>                  [] - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: /<TM-IP>:35376
> 2022-10-11 22:11:02,683 WARN  akka.remote.ReliableDeliverySupervisor
>                  [] - Association with remote system [akka.tcp://flink@<TM-IP>:35376]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@<TM-IP>:35376]] Caused by:
> [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
> 2022-10-11 22:11:12,702 WARN  akka.remote.transport.netty.NettyTransport
>                  [] - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: /<TM-IP>:35376
> 2022-10-11 22:11:12,703 WARN  akka.remote.ReliableDeliverySupervisor
>                  [] - Association with remote system [akka.tcp://flink@<TM-IP>:35376]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@<TM-IP>:35376]] Caused by:
> [java.net.ConnectException: Connection refused: /<TM-IP>:35376]
> 2022-10-11 22:11:21,683 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED
> SIGNAL 15: SIGTERM. Shutting down as requested.
> 2022-10-11 22:11:21,687 INFO  org.apache.flink.runtime.blob.BlobServer
>                  [] - Stopped BLOB server at 0.0.0.0:33887
>
>
> Regards,
> Puneet
>
>
>