You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by John Smith <ja...@gmail.com> on 2019/10/03 15:01:38 UTC

Warnings connecting to Akka

Hi running 1.8 the cluster seems to be OK but I see these warnings in the
logs...

2019-10-03 14:57:25,152 WARN  akka.remote.transport.netty.NettyTransport
                 - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: /xxx.xxx.xxx.65:46167
2019-10-03 14:57:25,156 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink@xxx.xxx.xxx.65:46167] has failed, address is now gated
for [50] ms. Reason: [Association failed with
[akka.tcp://flink@xxx.xxx.xxx.65:46167]] Caused by: [Connection refused:
/xxx.xxx.xxx.65:46167]

Re: Warnings connecting to Akka

Posted by John Smith <ja...@gmail.com>.
Oh that's fine. I was just wondering why it happened. It seems to have gone
away since the reboot.

On Fri, 18 Oct 2019 at 10:43, Till Rohrmann <tr...@apache.org> wrote:

> Hi John,
>
> the reason why you are seeing these warnings is because Akka tries to
> re-establish the connection to a lost endpoint (here a dead TaskExecutor).
> This should continue until the connection is either quarantined or if the
> underlying ActorRef to the remote endpoint has been garbage collected. The
> former should not really happen and the latter should happen after Flink
> has realized that the TaskExecutor has died. Flink uses its own heartbeats
> to detect this. Depending on the configuration (default value is 50s), this
> can take a bit. However, the warnings should eventually stop to be
> displayed.
>
> I admit that this is not ideal in a scenario where TaskExecutors die
> regularly but it helps to debug problematic scenarios. One way to suppress
> these statements is to set the logger for akka.remote to ERROR. But then
> one would not see if Akka has lost the connection and tries to reconnect.
>
> Cheers,
> Till
>
> On Thu, Oct 10, 2019 at 5:31 PM John Smith <ja...@gmail.com> wrote:
>
>> Ok so it seems there was some sort of network issue. Then leader
>> election. But it seems it had some old state and kept trying to connect to
>> the same task machine over and over...?
>>
>> 2019-09-19 22:26:14,841 INFO
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> Unable to read additional data from server sessionid 0xXXXXXX, likely
>> server has closed socket, closing socket connection and attempting reconnect
>> 2019-09-19 22:26:14,946 INFO
>>  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>>  - State change: SUSPENDED
>> 2019-09-19 22:26:14,947 WARN
>>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper suspended. The contender http://XXXXXX-2:8081 no
>> longer participates in the leader election.
>> 2019-09-19 22:26:14,947 WARN
>>  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>  - Connection to ZooKeeper suspended. Can no longer retrieve the leader
>> from ZooKeeper.
>> 2019-09-19 22:26:14,947 WARN
>>  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>  - Connection to ZooKeeper suspended. Can no longer retrieve the leader
>> from ZooKeeper.
>> 2019-09-19 22:26:14,948 WARN
>>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper suspended. The contender akka.tcp://flink@XXXXXX-2:37697/user/resourcemanager
>> no longer participates in the leader election.
>> 2019-09-19 22:26:14,948 WARN
>>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper suspended. The contender akka.tcp://flink@fXXXXXX-2:37697/user/dispatcher
>> no longer participates in the leader election.
>> 2019-09-19 22:26:14,949 WARN
>>  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>> ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are
>> not monitored (temporarily).
>> 2019-09-19 22:26:25,185 WARN
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
>> configuration failed: javax.security.auth.login.LoginException: No JAAS
>> configuration section named 'Client' was found in specified JAAS
>> configuration file: '/tmp/jaas-2423287132287811787.conf'. Will continue
>> connection to Zookeeper server without SASL authentication, if Zookeeper
>> server allows it.
>> 2019-09-19 22:26:25,186 INFO
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> Opening socket connection to server XXXXXX.71/XXXXXX.71:2181
>> 2019-09-19 22:26:25,186 ERROR
>> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
>> Authentication failed
>> 2019-09-19 22:26:25,192 INFO
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> Socket connection established to XXXXXX.71/XXXXXX.71:2181, initiating
>> session
>> 2019-09-19 22:26:25,199 WARN
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> Unable to reconnect to ZooKeeper service, session 0x3017fc1a6660000 has
>> expired
>> 2019-09-19 22:26:25,199 INFO
>>  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>>  - State change: LOST
>> 2019-09-19 22:26:25,199 WARN
>>  org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
>> Session expired event received
>> 2019-09-19 22:26:25,199 WARN
>>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper lost. The contender http://XXXXXX-2:8081 no
>> longer participates in the leader election.
>> 2019-09-19 22:26:25,199 WARN
>>  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>  - Connection to ZooKeeper lost. Can no longer retrieve the leader from
>> ZooKeeper.
>> 2019-09-19 22:26:25,200 WARN
>>  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>  - Connection to ZooKeeper lost. Can no longer retrieve the leader from
>> ZooKeeper.
>> 2019-09-19 22:26:25,199 INFO
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  -
>> Initiating client connection,
>> connectString=XXXXXX-1.XXXXXX:2181,XXXXXX-2.XXXXXX:2181,XXXXXX-3.XXXXXX:2181
>> sessionTimeout=60000
>> watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@2bec854f
>> 2019-09-19 22:26:25,200 WARN
>>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper lost. The contender akka.tcp://flink@XXXXXX-2:37697/user/resourcemanager
>> no longer participates in the leader election.
>> 2019-09-19 22:26:25,200 WARN
>>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper lost. The contender akka.tcp://flink@XXXXXX-2:37697/user/dispatcher
>> no longer participates in the leader election.
>> 2019-09-19 22:26:25,201 INFO
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> Unable to reconnect to ZooKeeper service, session 0x3017fc1a6660000 has
>> expired, closing socket connection
>> 2019-09-19 22:26:25,201 WARN
>>  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>> ZooKeeper connection LOST. Changes to the submitted job graphs are not
>> monitored (permanently).
>> 2019-09-19 22:26:25,220 INFO
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> EventThread shut down for session: 0x3017fc1a6660000
>> 2019-09-19 22:26:25,231 WARN
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
>> configuration failed: javax.security.auth.login.LoginException: No JAAS
>> configuration section named 'Client' was found in specified JAAS
>> configuration file: '/tmp/jaas-2423287132287811787.conf'. Will continue
>> connection to Zookeeper server without SASL authentication, if Zookeeper
>> server allows it.
>> 2019-09-19 22:26:25,232 INFO
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> Opening socket connection to server XXXXXX.33/XXXXXX.33:2181
>> 2019-09-19 22:26:25,232 ERROR
>> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
>> Authentication failed
>> 2019-09-19 22:26:25,233 INFO
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> Socket connection established to XXXXXX.33/XXXXXX.33:2181, initiating
>> session
>> 2019-09-19 22:26:25,247 INFO
>>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> Session establishment complete on server XXXXXX.33/XXXXXX.33:2181,
>> sessionid = 0x301db1787060000, negotiated timeout = 40000
>> 2019-09-19 22:26:25,247 INFO
>>  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>>  - State change: RECONNECTED
>> 2019-09-19 22:26:25,248 INFO
>>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>> 2019-09-19 22:26:25,253 INFO
>>  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>  - Connection to ZooKeeper was reconnected. Leader retrieval can be
>> restarted.
>> 2019-09-19 22:26:25,253 INFO
>>  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>>  - Connection to ZooKeeper was reconnected. Leader retrieval can be
>> restarted.
>> 2019-09-19 22:26:25,253 INFO
>>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>> 2019-09-19 22:26:25,253 INFO
>>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Connection to ZooKeeper was reconnected. Leader election can be restarted.
>> 2019-09-19 22:26:25,253 INFO
>>  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>> ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are
>> monitored again.
>> 2019-09-19 22:26:34,376 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system
>> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
>> [50] ms. Reason: [Disassociated]
>> 2019-09-19 22:26:34,376 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system
>> [akka.tcp://flink-metrics@XXXXXX.11:38091] has failed, address is now
>> gated for [50] ms. Reason: [Disassociated]
>> 2019-09-19 22:26:35,147 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
>> 2019-09-19 22:26:35,149 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system
>> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
>> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
>> Caused by: [Connection refused: /XXXXXX.11:46167]
>> 2019-09-19 22:26:45,167 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
>> 2019-09-19 22:26:45,168 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system
>> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
>> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
>> Caused by: [Connection refused: /XXXXXX.11:46167]
>> 2019-09-19 22:26:55,151 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
>> 2019-09-19 22:26:55,153 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system
>> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
>> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
>> Caused by: [Connection refused: /XXXXXX.11:46167]
>> 2019-09-19 22:27:05,159 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
>> 2019-09-19 22:27:05,160 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system
>> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
>> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
>> Caused by: [Connection refused: /XXXXXX.11:46167]
>> 2019-09-19 22:27:15,157 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
>> 2019-09-19 22:27:15,161 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system
>> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
>> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
>> Caused by: [Connection refused: /XXXXXX.11:46167]
>> 2019-09-19 22:27:25,152 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
>> 2019-09-19 22:27:25,160 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system
>> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
>> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
>> Caused by: [Connection refused: /XXXXXX.11:46167]
>> 2019-09-19 22:27:35,161 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
>> 2019-09-19 22:27:35,165 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system
>> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
>> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
>> Caused by: [Connection refused: /XXXXXX.11:46167]
>>
>>
>>
>> On Wed, 9 Oct 2019 at 19:44, Timothy Victor <vi...@gmail.com> wrote:
>>
>>> We see a very similar (if not the same) error running version 1.9 on
>>> Kubernetes.   So far what we have discovered is that a taskmanager gets
>>> killed and a new one is created, but JM still thinks it needs to connect to
>>> the old (now dead TM).  I was even able to see the a taskmanager on the
>>> same host and port but with different TM instance ids in the Flink UI.  The
>>> issue seems to be persistent (i.e. doesn't clear after a few minutes).
>>>
>>> FWIW...TM was dying due to livenessprobe in K8s.   We have increased
>>> that, but still the above issue is a concern.
>>>
>>> Any ideas?
>>>
>>> Tim
>>>
>>> On Wed, Oct 9, 2019, 3:15 PM John Smith <ja...@gmail.com> wrote:
>>>
>>>> Sorry been away on leave. I'll check ASAP.
>>>>
>>>> On Thu, 3 Oct 2019 at 20:52, Zili Chen <wa...@gmail.com> wrote:
>>>>
>>>>> Does the log you attached above come from a TaskManager Node? If so,
>>>>> what state is the Job node it tried to connect to? Did it crash?
>>>>>
>>>>> BTW, it would be helpful if you can attach more logs of TM and JM
>>>>> except
>>>>> two lines said akka connection refused.
>>>>>
>>>>>
>>>>> John Smith <ja...@gmail.com> 于2019年10月4日周五 上午2:08写道:
>>>>>
>>>>>> So I guess it had some older state?
>>>>>>
>>>>>> On Thu., Oct. 3, 2019, 11:29 a.m. John Smith, <ja...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I'm running standalone cluster with Zookeeper. It seems it was
>>>>>>> trying to connect to an older node. I rebooted the Job node tha was
>>>>>>> complaining. It seems to be ok now...
>>>>>>>
>>>>>>> I have 3 Zookeepers, 3 Job Nodes and 3 Tasks Nodes
>>>>>>>
>>>>>>> On Thu, 3 Oct 2019 at 11:15, Zili Chen <wa...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi John,
>>>>>>>>
>>>>>>>> could you provide some details such as which mode you runs
>>>>>>>> on(standalone/YARN)
>>>>>>>> and related configuration(jobmanager.address jobmanager.port and so
>>>>>>>> on)?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> tison.
>>>>>>>>
>>>>>>>>
>>>>>>>> John Smith <ja...@gmail.com> 于2019年10月3日周四 下午11:02写道:
>>>>>>>>
>>>>>>>>> Hi running 1.8 the cluster seems to be OK but I see these warnings
>>>>>>>>> in the logs...
>>>>>>>>>
>>>>>>>>> 2019-10-03 14:57:25,152 WARN
>>>>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>>>>> refused: /xxx.xxx.xxx.65:46167
>>>>>>>>> 2019-10-03 14:57:25,156 WARN
>>>>>>>>>  akka.remote.ReliableDeliverySupervisor                        -
>>>>>>>>> Association with remote system [akka.tcp://flink@xxx.xxx.xxx.65:46167]
>>>>>>>>> has failed, address is now gated for [50] ms. Reason: [Association failed
>>>>>>>>> with [akka.tcp://flink@xxx.xxx.xxx.65:46167]] Caused by:
>>>>>>>>> [Connection refused: /xxx.xxx.xxx.65:46167]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>

Re: Warnings connecting to Akka

Posted by Till Rohrmann <tr...@apache.org>.
Hi John,

the reason why you are seeing these warnings is because Akka tries to
re-establish the connection to a lost endpoint (here a dead TaskExecutor).
This should continue until the connection is either quarantined or if the
underlying ActorRef to the remote endpoint has been garbage collected. The
former should not really happen and the latter should happen after Flink
has realized that the TaskExecutor has died. Flink uses its own heartbeats
to detect this. Depending on the configuration (default value is 50s), this
can take a bit. However, the warnings should eventually stop to be
displayed.

I admit that this is not ideal in a scenario where TaskExecutors die
regularly but it helps to debug problematic scenarios. One way to suppress
these statements is to set the logger for akka.remote to ERROR. But then
one would not see if Akka has lost the connection and tries to reconnect.

Cheers,
Till

On Thu, Oct 10, 2019 at 5:31 PM John Smith <ja...@gmail.com> wrote:

> Ok so it seems there was some sort of network issue. Then leader election.
> But it seems it had some old state and kept trying to connect to the same
> task machine over and over...?
>
> 2019-09-19 22:26:14,841 INFO
>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
> Unable to read additional data from server sessionid 0xXXXXXX, likely
> server has closed socket, closing socket connection and attempting reconnect
> 2019-09-19 22:26:14,946 INFO
>  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>  - State change: SUSPENDED
> 2019-09-19 22:26:14,947 WARN
>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Connection to ZooKeeper suspended. The contender http://XXXXXX-2:8081 no
> longer participates in the leader election.
> 2019-09-19 22:26:14,947 WARN
>  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>  - Connection to ZooKeeper suspended. Can no longer retrieve the leader
> from ZooKeeper.
> 2019-09-19 22:26:14,947 WARN
>  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>  - Connection to ZooKeeper suspended. Can no longer retrieve the leader
> from ZooKeeper.
> 2019-09-19 22:26:14,948 WARN
>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Connection to ZooKeeper suspended. The contender akka.tcp://flink@XXXXXX-2:37697/user/resourcemanager
> no longer participates in the leader election.
> 2019-09-19 22:26:14,948 WARN
>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Connection to ZooKeeper suspended. The contender akka.tcp://flink@fXXXXXX-2:37697/user/dispatcher
> no longer participates in the leader election.
> 2019-09-19 22:26:14,949 WARN
>  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
> ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are
> not monitored (temporarily).
> 2019-09-19 22:26:25,185 WARN
>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
> configuration failed: javax.security.auth.login.LoginException: No JAAS
> configuration section named 'Client' was found in specified JAAS
> configuration file: '/tmp/jaas-2423287132287811787.conf'. Will continue
> connection to Zookeeper server without SASL authentication, if Zookeeper
> server allows it.
> 2019-09-19 22:26:25,186 INFO
>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
> Opening socket connection to server XXXXXX.71/XXXXXX.71:2181
> 2019-09-19 22:26:25,186 ERROR
> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
> Authentication failed
> 2019-09-19 22:26:25,192 INFO
>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
> Socket connection established to XXXXXX.71/XXXXXX.71:2181, initiating
> session
> 2019-09-19 22:26:25,199 WARN
>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
> Unable to reconnect to ZooKeeper service, session 0x3017fc1a6660000 has
> expired
> 2019-09-19 22:26:25,199 INFO
>  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>  - State change: LOST
> 2019-09-19 22:26:25,199 WARN
>  org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
> Session expired event received
> 2019-09-19 22:26:25,199 WARN
>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Connection to ZooKeeper lost. The contender http://XXXXXX-2:8081 no
> longer participates in the leader election.
> 2019-09-19 22:26:25,199 WARN
>  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>  - Connection to ZooKeeper lost. Can no longer retrieve the leader from
> ZooKeeper.
> 2019-09-19 22:26:25,200 WARN
>  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>  - Connection to ZooKeeper lost. Can no longer retrieve the leader from
> ZooKeeper.
> 2019-09-19 22:26:25,199 INFO
>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  -
> Initiating client connection,
> connectString=XXXXXX-1.XXXXXX:2181,XXXXXX-2.XXXXXX:2181,XXXXXX-3.XXXXXX:2181
> sessionTimeout=60000
> watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@2bec854f
> 2019-09-19 22:26:25,200 WARN
>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Connection to ZooKeeper lost. The contender akka.tcp://flink@XXXXXX-2:37697/user/resourcemanager
> no longer participates in the leader election.
> 2019-09-19 22:26:25,200 WARN
>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Connection to ZooKeeper lost. The contender akka.tcp://flink@XXXXXX-2:37697/user/dispatcher
> no longer participates in the leader election.
> 2019-09-19 22:26:25,201 INFO
>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
> Unable to reconnect to ZooKeeper service, session 0x3017fc1a6660000 has
> expired, closing socket connection
> 2019-09-19 22:26:25,201 WARN
>  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
> ZooKeeper connection LOST. Changes to the submitted job graphs are not
> monitored (permanently).
> 2019-09-19 22:26:25,220 INFO
>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
> EventThread shut down for session: 0x3017fc1a6660000
> 2019-09-19 22:26:25,231 WARN
>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
> configuration failed: javax.security.auth.login.LoginException: No JAAS
> configuration section named 'Client' was found in specified JAAS
> configuration file: '/tmp/jaas-2423287132287811787.conf'. Will continue
> connection to Zookeeper server without SASL authentication, if Zookeeper
> server allows it.
> 2019-09-19 22:26:25,232 INFO
>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
> Opening socket connection to server XXXXXX.33/XXXXXX.33:2181
> 2019-09-19 22:26:25,232 ERROR
> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
> Authentication failed
> 2019-09-19 22:26:25,233 INFO
>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
> Socket connection established to XXXXXX.33/XXXXXX.33:2181, initiating
> session
> 2019-09-19 22:26:25,247 INFO
>  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
> Session establishment complete on server XXXXXX.33/XXXXXX.33:2181,
> sessionid = 0x301db1787060000, negotiated timeout = 40000
> 2019-09-19 22:26:25,247 INFO
>  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>  - State change: RECONNECTED
> 2019-09-19 22:26:25,248 INFO
>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Connection to ZooKeeper was reconnected. Leader election can be restarted.
> 2019-09-19 22:26:25,253 INFO
>  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>  - Connection to ZooKeeper was reconnected. Leader retrieval can be
> restarted.
> 2019-09-19 22:26:25,253 INFO
>  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
>  - Connection to ZooKeeper was reconnected. Leader retrieval can be
> restarted.
> 2019-09-19 22:26:25,253 INFO
>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Connection to ZooKeeper was reconnected. Leader election can be restarted.
> 2019-09-19 22:26:25,253 INFO
>  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Connection to ZooKeeper was reconnected. Leader election can be restarted.
> 2019-09-19 22:26:25,253 INFO
>  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
> ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are
> monitored again.
> 2019-09-19 22:26:34,376 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system
> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
> [50] ms. Reason: [Disassociated]
> 2019-09-19 22:26:34,376 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system
> [akka.tcp://flink-metrics@XXXXXX.11:38091] has failed, address is now
> gated for [50] ms. Reason: [Disassociated]
> 2019-09-19 22:26:35,147 WARN  akka.remote.transport.netty.NettyTransport
>                  - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
> 2019-09-19 22:26:35,149 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system
> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
> Caused by: [Connection refused: /XXXXXX.11:46167]
> 2019-09-19 22:26:45,167 WARN  akka.remote.transport.netty.NettyTransport
>                  - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
> 2019-09-19 22:26:45,168 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system
> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
> Caused by: [Connection refused: /XXXXXX.11:46167]
> 2019-09-19 22:26:55,151 WARN  akka.remote.transport.netty.NettyTransport
>                  - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
> 2019-09-19 22:26:55,153 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system
> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
> Caused by: [Connection refused: /XXXXXX.11:46167]
> 2019-09-19 22:27:05,159 WARN  akka.remote.transport.netty.NettyTransport
>                  - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
> 2019-09-19 22:27:05,160 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system
> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
> Caused by: [Connection refused: /XXXXXX.11:46167]
> 2019-09-19 22:27:15,157 WARN  akka.remote.transport.netty.NettyTransport
>                  - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
> 2019-09-19 22:27:15,161 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system
> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
> Caused by: [Connection refused: /XXXXXX.11:46167]
> 2019-09-19 22:27:25,152 WARN  akka.remote.transport.netty.NettyTransport
>                  - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
> 2019-09-19 22:27:25,160 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system
> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
> Caused by: [Connection refused: /XXXXXX.11:46167]
> 2019-09-19 22:27:35,161 WARN  akka.remote.transport.netty.NettyTransport
>                  - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: /XXXXXX.11:46167
> 2019-09-19 22:27:35,165 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system
> [akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
> [50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
> Caused by: [Connection refused: /XXXXXX.11:46167]
>
>
>
> On Wed, 9 Oct 2019 at 19:44, Timothy Victor <vi...@gmail.com> wrote:
>
>> We see a very similar (if not the same) error running version 1.9 on
>> Kubernetes.   So far what we have discovered is that a taskmanager gets
>> killed and a new one is created, but JM still thinks it needs to connect to
>> the old (now dead TM).  I was even able to see the a taskmanager on the
>> same host and port but with different TM instance ids in the Flink UI.  The
>> issue seems to be persistent (i.e. doesn't clear after a few minutes).
>>
>> FWIW...TM was dying due to livenessprobe in K8s.   We have increased
>> that, but still the above issue is a concern.
>>
>> Any ideas?
>>
>> Tim
>>
>> On Wed, Oct 9, 2019, 3:15 PM John Smith <ja...@gmail.com> wrote:
>>
>>> Sorry been away on leave. I'll check ASAP.
>>>
>>> On Thu, 3 Oct 2019 at 20:52, Zili Chen <wa...@gmail.com> wrote:
>>>
>>>> Does the log you attached above come from a TaskManager Node? If so,
>>>> what state is the Job node it tried to connect to? Did it crash?
>>>>
>>>> BTW, it would be helpful if you can attach more logs of TM and JM except
>>>> two lines said akka connection refused.
>>>>
>>>>
>>>> John Smith <ja...@gmail.com> 于2019年10月4日周五 上午2:08写道:
>>>>
>>>>> So I guess it had some older state?
>>>>>
>>>>> On Thu., Oct. 3, 2019, 11:29 a.m. John Smith, <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I'm running standalone cluster with Zookeeper. It seems it was trying
>>>>>> to connect to an older node. I rebooted the Job node tha was complaining.
>>>>>> It seems to be ok now...
>>>>>>
>>>>>> I have 3 Zookeepers, 3 Job Nodes and 3 Tasks Nodes
>>>>>>
>>>>>> On Thu, 3 Oct 2019 at 11:15, Zili Chen <wa...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi John,
>>>>>>>
>>>>>>> could you provide some details such as which mode you runs
>>>>>>> on(standalone/YARN)
>>>>>>> and related configuration(jobmanager.address jobmanager.port and so
>>>>>>> on)?
>>>>>>>
>>>>>>> Best,
>>>>>>> tison.
>>>>>>>
>>>>>>>
>>>>>>> John Smith <ja...@gmail.com> 于2019年10月3日周四 下午11:02写道:
>>>>>>>
>>>>>>>> Hi running 1.8 the cluster seems to be OK but I see these warnings
>>>>>>>> in the logs...
>>>>>>>>
>>>>>>>> 2019-10-03 14:57:25,152 WARN
>>>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>>>> refused: /xxx.xxx.xxx.65:46167
>>>>>>>> 2019-10-03 14:57:25,156 WARN
>>>>>>>>  akka.remote.ReliableDeliverySupervisor                        -
>>>>>>>> Association with remote system [akka.tcp://flink@xxx.xxx.xxx.65:46167]
>>>>>>>> has failed, address is now gated for [50] ms. Reason: [Association failed
>>>>>>>> with [akka.tcp://flink@xxx.xxx.xxx.65:46167]] Caused by:
>>>>>>>> [Connection refused: /xxx.xxx.xxx.65:46167]
>>>>>>>>
>>>>>>>>
>>>>>>>>

Re: Warnings connecting to Akka

Posted by John Smith <ja...@gmail.com>.
Ok so it seems there was some sort of network issue. Then leader election.
But it seems it had some old state and kept trying to connect to the same
task machine over and over...?

2019-09-19 22:26:14,841 INFO
 org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
Unable to read additional data from server sessionid 0xXXXXXX, likely
server has closed socket, closing socket connection and attempting reconnect
2019-09-19 22:26:14,946 INFO
 org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
 - State change: SUSPENDED
2019-09-19 22:26:14,947 WARN
 org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Connection to ZooKeeper suspended. The contender http://XXXXXX-2:8081 no
longer participates in the leader election.
2019-09-19 22:26:14,947 WARN
 org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
 - Connection to ZooKeeper suspended. Can no longer retrieve the leader
from ZooKeeper.
2019-09-19 22:26:14,947 WARN
 org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
 - Connection to ZooKeeper suspended. Can no longer retrieve the leader
from ZooKeeper.
2019-09-19 22:26:14,948 WARN
 org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Connection to ZooKeeper suspended. The contender
akka.tcp://flink@XXXXXX-2:37697/user/resourcemanager
no longer participates in the leader election.
2019-09-19 22:26:14,948 WARN
 org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Connection to ZooKeeper suspended. The contender
akka.tcp://flink@fXXXXXX-2:37697/user/dispatcher
no longer participates in the leader election.
2019-09-19 22:26:14,949 WARN
 org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are
not monitored (temporarily).
2019-09-19 22:26:25,185 WARN
 org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
configuration failed: javax.security.auth.login.LoginException: No JAAS
configuration section named 'Client' was found in specified JAAS
configuration file: '/tmp/jaas-2423287132287811787.conf'. Will continue
connection to Zookeeper server without SASL authentication, if Zookeeper
server allows it.
2019-09-19 22:26:25,186 INFO
 org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
Opening socket connection to server XXXXXX.71/XXXXXX.71:2181
2019-09-19 22:26:25,186 ERROR
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
Authentication failed
2019-09-19 22:26:25,192 INFO
 org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
Socket connection established to XXXXXX.71/XXXXXX.71:2181, initiating
session
2019-09-19 22:26:25,199 WARN
 org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
Unable to reconnect to ZooKeeper service, session 0x3017fc1a6660000 has
expired
2019-09-19 22:26:25,199 INFO
 org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
 - State change: LOST
2019-09-19 22:26:25,199 WARN
 org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
Session expired event received
2019-09-19 22:26:25,199 WARN
 org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Connection to ZooKeeper lost. The contender http://XXXXXX-2:8081 no longer
participates in the leader election.
2019-09-19 22:26:25,199 WARN
 org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
 - Connection to ZooKeeper lost. Can no longer retrieve the leader from
ZooKeeper.
2019-09-19 22:26:25,200 WARN
 org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
 - Connection to ZooKeeper lost. Can no longer retrieve the leader from
ZooKeeper.
2019-09-19 22:26:25,199 INFO
 org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  -
Initiating client connection,
connectString=XXXXXX-1.XXXXXX:2181,XXXXXX-2.XXXXXX:2181,XXXXXX-3.XXXXXX:2181
sessionTimeout=60000
watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@2bec854f
2019-09-19 22:26:25,200 WARN
 org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Connection to ZooKeeper lost. The contender
akka.tcp://flink@XXXXXX-2:37697/user/resourcemanager
no longer participates in the leader election.
2019-09-19 22:26:25,200 WARN
 org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Connection to ZooKeeper lost. The contender
akka.tcp://flink@XXXXXX-2:37697/user/dispatcher
no longer participates in the leader election.
2019-09-19 22:26:25,201 INFO
 org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
Unable to reconnect to ZooKeeper service, session 0x3017fc1a6660000 has
expired, closing socket connection
2019-09-19 22:26:25,201 WARN
 org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
ZooKeeper connection LOST. Changes to the submitted job graphs are not
monitored (permanently).
2019-09-19 22:26:25,220 INFO
 org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
EventThread shut down for session: 0x3017fc1a6660000
2019-09-19 22:26:25,231 WARN
 org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - SASL
configuration failed: javax.security.auth.login.LoginException: No JAAS
configuration section named 'Client' was found in specified JAAS
configuration file: '/tmp/jaas-2423287132287811787.conf'. Will continue
connection to Zookeeper server without SASL authentication, if Zookeeper
server allows it.
2019-09-19 22:26:25,232 INFO
 org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
Opening socket connection to server XXXXXX.33/XXXXXX.33:2181
2019-09-19 22:26:25,232 ERROR
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  -
Authentication failed
2019-09-19 22:26:25,233 INFO
 org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
Socket connection established to XXXXXX.33/XXXXXX.33:2181, initiating
session
2019-09-19 22:26:25,247 INFO
 org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
Session establishment complete on server XXXXXX.33/XXXXXX.33:2181,
sessionid = 0x301db1787060000, negotiated timeout = 40000
2019-09-19 22:26:25,247 INFO
 org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
 - State change: RECONNECTED
2019-09-19 22:26:25,248 INFO
 org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-09-19 22:26:25,253 INFO
 org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
 - Connection to ZooKeeper was reconnected. Leader retrieval can be
restarted.
2019-09-19 22:26:25,253 INFO
 org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
 - Connection to ZooKeeper was reconnected. Leader retrieval can be
restarted.
2019-09-19 22:26:25,253 INFO
 org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-09-19 22:26:25,253 INFO
 org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-09-19 22:26:25,253 INFO
 org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are
monitored again.
2019-09-19 22:26:34,376 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
[50] ms. Reason: [Disassociated]
2019-09-19 22:26:34,376 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink-metrics@XXXXXX.11:38091] has failed, address is now gated
for [50] ms. Reason: [Disassociated]
2019-09-19 22:26:35,147 WARN  akka.remote.transport.netty.NettyTransport
                 - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: /XXXXXX.11:46167
2019-09-19 22:26:35,149 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
[50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
Caused by: [Connection refused: /XXXXXX.11:46167]
2019-09-19 22:26:45,167 WARN  akka.remote.transport.netty.NettyTransport
                 - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: /XXXXXX.11:46167
2019-09-19 22:26:45,168 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
[50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
Caused by: [Connection refused: /XXXXXX.11:46167]
2019-09-19 22:26:55,151 WARN  akka.remote.transport.netty.NettyTransport
                 - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: /XXXXXX.11:46167
2019-09-19 22:26:55,153 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
[50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
Caused by: [Connection refused: /XXXXXX.11:46167]
2019-09-19 22:27:05,159 WARN  akka.remote.transport.netty.NettyTransport
                 - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: /XXXXXX.11:46167
2019-09-19 22:27:05,160 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
[50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
Caused by: [Connection refused: /XXXXXX.11:46167]
2019-09-19 22:27:15,157 WARN  akka.remote.transport.netty.NettyTransport
                 - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: /XXXXXX.11:46167
2019-09-19 22:27:15,161 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
[50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
Caused by: [Connection refused: /XXXXXX.11:46167]
2019-09-19 22:27:25,152 WARN  akka.remote.transport.netty.NettyTransport
                 - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: /XXXXXX.11:46167
2019-09-19 22:27:25,160 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
[50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
Caused by: [Connection refused: /XXXXXX.11:46167]
2019-09-19 22:27:35,161 WARN  akka.remote.transport.netty.NettyTransport
                 - Remote connection to [null] failed with
java.net.ConnectException: Connection refused: /XXXXXX.11:46167
2019-09-19 22:27:35,165 WARN  akka.remote.ReliableDeliverySupervisor
                 - Association with remote system
[akka.tcp://flink@XXXXXX.11:46167] has failed, address is now gated for
[50] ms. Reason: [Association failed with [akka.tcp://flink@XXXXXX.11:46167]]
Caused by: [Connection refused: /XXXXXX.11:46167]



On Wed, 9 Oct 2019 at 19:44, Timothy Victor <vi...@gmail.com> wrote:

> We see a very similar (if not the same) error running version 1.9 on
> Kubernetes.   So far what we have discovered is that a taskmanager gets
> killed and a new one is created, but JM still thinks it needs to connect to
> the old (now dead TM).  I was even able to see the a taskmanager on the
> same host and port but with different TM instance ids in the Flink UI.  The
> issue seems to be persistent (i.e. doesn't clear after a few minutes).
>
> FWIW...TM was dying due to livenessprobe in K8s.   We have increased that,
> but still the above issue is a concern.
>
> Any ideas?
>
> Tim
>
> On Wed, Oct 9, 2019, 3:15 PM John Smith <ja...@gmail.com> wrote:
>
>> Sorry been away on leave. I'll check ASAP.
>>
>> On Thu, 3 Oct 2019 at 20:52, Zili Chen <wa...@gmail.com> wrote:
>>
>>> Does the log you attached above come from a TaskManager Node? If so,
>>> what state is the Job node it tried to connect to? Did it crash?
>>>
>>> BTW, it would be helpful if you can attach more logs of TM and JM except
>>> two lines said akka connection refused.
>>>
>>>
>>> John Smith <ja...@gmail.com> 于2019年10月4日周五 上午2:08写道:
>>>
>>>> So I guess it had some older state?
>>>>
>>>> On Thu., Oct. 3, 2019, 11:29 a.m. John Smith, <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> I'm running standalone cluster with Zookeeper. It seems it was trying
>>>>> to connect to an older node. I rebooted the Job node tha was complaining.
>>>>> It seems to be ok now...
>>>>>
>>>>> I have 3 Zookeepers, 3 Job Nodes and 3 Tasks Nodes
>>>>>
>>>>> On Thu, 3 Oct 2019 at 11:15, Zili Chen <wa...@gmail.com> wrote:
>>>>>
>>>>>> Hi John,
>>>>>>
>>>>>> could you provide some details such as which mode you runs
>>>>>> on(standalone/YARN)
>>>>>> and related configuration(jobmanager.address jobmanager.port and so
>>>>>> on)?
>>>>>>
>>>>>> Best,
>>>>>> tison.
>>>>>>
>>>>>>
>>>>>> John Smith <ja...@gmail.com> 于2019年10月3日周四 下午11:02写道:
>>>>>>
>>>>>>> Hi running 1.8 the cluster seems to be OK but I see these warnings
>>>>>>> in the logs...
>>>>>>>
>>>>>>> 2019-10-03 14:57:25,152 WARN
>>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>>> refused: /xxx.xxx.xxx.65:46167
>>>>>>> 2019-10-03 14:57:25,156 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>>                        - Association with remote system
>>>>>>> [akka.tcp://flink@xxx.xxx.xxx.65:46167] has failed, address is now
>>>>>>> gated for [50] ms. Reason: [Association failed with
>>>>>>> [akka.tcp://flink@xxx.xxx.xxx.65:46167]] Caused by: [Connection
>>>>>>> refused: /xxx.xxx.xxx.65:46167]
>>>>>>>
>>>>>>>
>>>>>>>

Re: Warnings connecting to Akka

Posted by Timothy Victor <vi...@gmail.com>.
We see a very similar (if not the same) error running version 1.9 on
Kubernetes.   So far what we have discovered is that a taskmanager gets
killed and a new one is created, but JM still thinks it needs to connect to
the old (now dead TM).  I was even able to see the a taskmanager on the
same host and port but with different TM instance ids in the Flink UI.  The
issue seems to be persistent (i.e. doesn't clear after a few minutes).

FWIW...TM was dying due to livenessprobe in K8s.   We have increased that,
but still the above issue is a concern.

Any ideas?

Tim

On Wed, Oct 9, 2019, 3:15 PM John Smith <ja...@gmail.com> wrote:

> Sorry been away on leave. I'll check ASAP.
>
> On Thu, 3 Oct 2019 at 20:52, Zili Chen <wa...@gmail.com> wrote:
>
>> Does the log you attached above come from a TaskManager Node? If so,
>> what state is the Job node it tried to connect to? Did it crash?
>>
>> BTW, it would be helpful if you can attach more logs of TM and JM except
>> two lines said akka connection refused.
>>
>>
>> John Smith <ja...@gmail.com> 于2019年10月4日周五 上午2:08写道:
>>
>>> So I guess it had some older state?
>>>
>>> On Thu., Oct. 3, 2019, 11:29 a.m. John Smith, <ja...@gmail.com>
>>> wrote:
>>>
>>>> I'm running standalone cluster with Zookeeper. It seems it was trying
>>>> to connect to an older node. I rebooted the Job node tha was complaining.
>>>> It seems to be ok now...
>>>>
>>>> I have 3 Zookeepers, 3 Job Nodes and 3 Tasks Nodes
>>>>
>>>> On Thu, 3 Oct 2019 at 11:15, Zili Chen <wa...@gmail.com> wrote:
>>>>
>>>>> Hi John,
>>>>>
>>>>> could you provide some details such as which mode you runs
>>>>> on(standalone/YARN)
>>>>> and related configuration(jobmanager.address jobmanager.port and so
>>>>> on)?
>>>>>
>>>>> Best,
>>>>> tison.
>>>>>
>>>>>
>>>>> John Smith <ja...@gmail.com> 于2019年10月3日周四 下午11:02写道:
>>>>>
>>>>>> Hi running 1.8 the cluster seems to be OK but I see these warnings in
>>>>>> the logs...
>>>>>>
>>>>>> 2019-10-03 14:57:25,152 WARN
>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>> refused: /xxx.xxx.xxx.65:46167
>>>>>> 2019-10-03 14:57:25,156 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>                        - Association with remote system
>>>>>> [akka.tcp://flink@xxx.xxx.xxx.65:46167] has failed, address is now
>>>>>> gated for [50] ms. Reason: [Association failed with
>>>>>> [akka.tcp://flink@xxx.xxx.xxx.65:46167]] Caused by: [Connection
>>>>>> refused: /xxx.xxx.xxx.65:46167]
>>>>>>
>>>>>>
>>>>>>

Re: Warnings connecting to Akka

Posted by John Smith <ja...@gmail.com>.
Sorry been away on leave. I'll check ASAP.

On Thu, 3 Oct 2019 at 20:52, Zili Chen <wa...@gmail.com> wrote:

> Does the log you attached above come from a TaskManager Node? If so,
> what state is the Job node it tried to connect to? Did it crash?
>
> BTW, it would be helpful if you can attach more logs of TM and JM except
> two lines said akka connection refused.
>
>
> John Smith <ja...@gmail.com> 于2019年10月4日周五 上午2:08写道:
>
>> So I guess it had some older state?
>>
>> On Thu., Oct. 3, 2019, 11:29 a.m. John Smith, <ja...@gmail.com>
>> wrote:
>>
>>> I'm running standalone cluster with Zookeeper. It seems it was trying to
>>> connect to an older node. I rebooted the Job node tha was complaining. It
>>> seems to be ok now...
>>>
>>> I have 3 Zookeepers, 3 Job Nodes and 3 Tasks Nodes
>>>
>>> On Thu, 3 Oct 2019 at 11:15, Zili Chen <wa...@gmail.com> wrote:
>>>
>>>> Hi John,
>>>>
>>>> could you provide some details such as which mode you runs
>>>> on(standalone/YARN)
>>>> and related configuration(jobmanager.address jobmanager.port and so on)?
>>>>
>>>> Best,
>>>> tison.
>>>>
>>>>
>>>> John Smith <ja...@gmail.com> 于2019年10月3日周四 下午11:02写道:
>>>>
>>>>> Hi running 1.8 the cluster seems to be OK but I see these warnings in
>>>>> the logs...
>>>>>
>>>>> 2019-10-03 14:57:25,152 WARN
>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>> refused: /xxx.xxx.xxx.65:46167
>>>>> 2019-10-03 14:57:25,156 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>                      - Association with remote system
>>>>> [akka.tcp://flink@xxx.xxx.xxx.65:46167] has failed, address is now
>>>>> gated for [50] ms. Reason: [Association failed with
>>>>> [akka.tcp://flink@xxx.xxx.xxx.65:46167]] Caused by: [Connection
>>>>> refused: /xxx.xxx.xxx.65:46167]
>>>>>
>>>>>
>>>>>

Re: Warnings connecting to Akka

Posted by Zili Chen <wa...@gmail.com>.
Does the log you attached above come from a TaskManager Node? If so,
what state is the Job node it tried to connect to? Did it crash?

BTW, it would be helpful if you can attach more logs of TM and JM except
two lines said akka connection refused.


John Smith <ja...@gmail.com> 于2019年10月4日周五 上午2:08写道:

> So I guess it had some older state?
>
> On Thu., Oct. 3, 2019, 11:29 a.m. John Smith, <ja...@gmail.com>
> wrote:
>
>> I'm running standalone cluster with Zookeeper. It seems it was trying to
>> connect to an older node. I rebooted the Job node tha was complaining. It
>> seems to be ok now...
>>
>> I have 3 Zookeepers, 3 Job Nodes and 3 Tasks Nodes
>>
>> On Thu, 3 Oct 2019 at 11:15, Zili Chen <wa...@gmail.com> wrote:
>>
>>> Hi John,
>>>
>>> could you provide some details such as which mode you runs
>>> on(standalone/YARN)
>>> and related configuration(jobmanager.address jobmanager.port and so on)?
>>>
>>> Best,
>>> tison.
>>>
>>>
>>> John Smith <ja...@gmail.com> 于2019年10月3日周四 下午11:02写道:
>>>
>>>> Hi running 1.8 the cluster seems to be OK but I see these warnings in
>>>> the logs...
>>>>
>>>> 2019-10-03 14:57:25,152 WARN
>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>> refused: /xxx.xxx.xxx.65:46167
>>>> 2019-10-03 14:57:25,156 WARN  akka.remote.ReliableDeliverySupervisor
>>>>                      - Association with remote system
>>>> [akka.tcp://flink@xxx.xxx.xxx.65:46167] has failed, address is now
>>>> gated for [50] ms. Reason: [Association failed with
>>>> [akka.tcp://flink@xxx.xxx.xxx.65:46167]] Caused by: [Connection
>>>> refused: /xxx.xxx.xxx.65:46167]
>>>>
>>>>
>>>>

Re: Warnings connecting to Akka

Posted by John Smith <ja...@gmail.com>.
So I guess it had some older state?

On Thu., Oct. 3, 2019, 11:29 a.m. John Smith, <ja...@gmail.com>
wrote:

> I'm running standalone cluster with Zookeeper. It seems it was trying to
> connect to an older node. I rebooted the Job node tha was complaining. It
> seems to be ok now...
>
> I have 3 Zookeepers, 3 Job Nodes and 3 Tasks Nodes
>
> On Thu, 3 Oct 2019 at 11:15, Zili Chen <wa...@gmail.com> wrote:
>
>> Hi John,
>>
>> could you provide some details such as which mode you runs
>> on(standalone/YARN)
>> and related configuration(jobmanager.address jobmanager.port and so on)?
>>
>> Best,
>> tison.
>>
>>
>> John Smith <ja...@gmail.com> 于2019年10月3日周四 下午11:02写道:
>>
>>> Hi running 1.8 the cluster seems to be OK but I see these warnings in
>>> the logs...
>>>
>>> 2019-10-03 14:57:25,152 WARN  akka.remote.transport.netty.NettyTransport
>>>                    - Remote connection to [null] failed with
>>> java.net.ConnectException: Connection refused: /xxx.xxx.xxx.65:46167
>>> 2019-10-03 14:57:25,156 WARN  akka.remote.ReliableDeliverySupervisor
>>>                    - Association with remote system
>>> [akka.tcp://flink@xxx.xxx.xxx.65:46167] has failed, address is now
>>> gated for [50] ms. Reason: [Association failed with
>>> [akka.tcp://flink@xxx.xxx.xxx.65:46167]] Caused by: [Connection
>>> refused: /xxx.xxx.xxx.65:46167]
>>>
>>>
>>>

Re: Warnings connecting to Akka

Posted by John Smith <ja...@gmail.com>.
I'm running standalone cluster with Zookeeper. It seems it was trying to
connect to an older node. I rebooted the Job node tha was complaining. It
seems to be ok now...

I have 3 Zookeepers, 3 Job Nodes and 3 Tasks Nodes

On Thu, 3 Oct 2019 at 11:15, Zili Chen <wa...@gmail.com> wrote:

> Hi John,
>
> could you provide some details such as which mode you runs
> on(standalone/YARN)
> and related configuration(jobmanager.address jobmanager.port and so on)?
>
> Best,
> tison.
>
>
> John Smith <ja...@gmail.com> 于2019年10月3日周四 下午11:02写道:
>
>> Hi running 1.8 the cluster seems to be OK but I see these warnings in the
>> logs...
>>
>> 2019-10-03 14:57:25,152 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: /xxx.xxx.xxx.65:46167
>> 2019-10-03 14:57:25,156 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system
>> [akka.tcp://flink@xxx.xxx.xxx.65:46167] has failed, address is now gated
>> for [50] ms. Reason: [Association failed with
>> [akka.tcp://flink@xxx.xxx.xxx.65:46167]] Caused by: [Connection refused:
>> /xxx.xxx.xxx.65:46167]
>>
>>
>>

Re: Warnings connecting to Akka

Posted by Zili Chen <wa...@gmail.com>.
Hi John,

could you provide some details such as which mode you runs
on(standalone/YARN)
and related configuration(jobmanager.address jobmanager.port and so on)?

Best,
tison.


John Smith <ja...@gmail.com> 于2019年10月3日周四 下午11:02写道:

> Hi running 1.8 the cluster seems to be OK but I see these warnings in the
> logs...
>
> 2019-10-03 14:57:25,152 WARN  akka.remote.transport.netty.NettyTransport
>                  - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: /xxx.xxx.xxx.65:46167
> 2019-10-03 14:57:25,156 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system
> [akka.tcp://flink@xxx.xxx.xxx.65:46167] has failed, address is now gated
> for [50] ms. Reason: [Association failed with
> [akka.tcp://flink@xxx.xxx.xxx.65:46167]] Caused by: [Connection refused:
> /xxx.xxx.xxx.65:46167]
>
>
>