You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by "Kumar Bolar, Harshith" <hk...@arity.com> on 2019/03/14 14:42:03 UTC

Flink 1.7.2: Task Manager not able to connect to Job Manager

Hi all,

I'm trying to upgrade our Flink cluster from 1.4.2 to 1.7.2

When I bring up the cluster, the task managers refuse to connect to the job managers with the following error.

        2019-03-14 10:34:41,551 WARN  akka.remote.ReliableDeliverySupervisor
        - Association with remote system [akka.tcp://flink@cluster:22671] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@cluster:22671]] Caused by: [cluster: Name or service not known]

Now, this works correctly if I add the following line into the /etc/hosts file.

        x.x.x.x job-manager-address.com cluster

Why is Flink 1.7.2 connecting to JM using cluster in the address? Flink 1.4.2 used to have the job manager's address instead of the word cluster.

Thanks,
Harshith


Re: Re: Re: Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

Posted by "Kumar Bolar, Harshith" <hk...@arity.com>.
Hi Gary,

The job manager was indeed being invoked with a second parameter.

${Flink_HOME}/bin/jobmanager.sh start cluster

I removed the second argument and everything works fine now. I really appreciate your help. Thanks a lot :-)

Regards,
Harshith

From: Gary Yao <ga...@ververica.com>
Date: Friday, 15 March 2019 at 12:41 PM
To: Harshith Kumar Bolar <hk...@arity.com>
Cc: user <us...@flink.apache.org>
Subject: [External] Re: Re: Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

I forgot to add line numbers to the first link in my previous email:

    https://github.com/apache/flink/blob/c6878aca6c5aeee46581b4d6744b31049db9de95/flink-dist/src/main/flink-bin/bin/jobmanager.sh#L21-L25<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_flink_blob_c6878aca6c5aeee46581b4d6744b31049db9de95_flink-2Ddist_src_main_flink-2Dbin_bin_jobmanager.sh-23L21-2DL25&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=Zjr-keKi2IBMDTHA8ihnUHHIICDPlHlBQ5YHyd0jCsg&e=>

On Fri, Mar 15, 2019 at 8:08 AM Gary Yao <ga...@ververica.com>> wrote:
Hi Harshith,

In the jobmanager.sh<https://urldefense.proofpoint.com/v2/url?u=http-3A__jobmanager.sh&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=ucI2Ko8YXht8q_dDoC1y1PFDNCR71WMQhOsNmEHaTQ8&e=> script, the 2nd argument is assigned to the HOST variable
[1]. How are you invoking jobmanager.sh?<https://urldefense.proofpoint.com/v2/url?u=http-3A__jobmanager.sh-3F&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=Qs4ewIueVgfMDLe2mEGG52OO0Iz1AenYYEvMC4BRTyE&e=> Prior to 1.5, the script expected an
execution mode (local or cluster) but this is no longer the case [2].

Best,
Gary

[1] https://github.com/apache/flink/blob/c6878aca6c5aeee46581b4d6744b31049db9de95/flink-dist/src/main/flink-bin/bin/jobmanager.sh<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_flink_blob_c6878aca6c5aeee46581b4d6744b31049db9de95_flink-2Ddist_src_main_flink-2Dbin_bin_jobmanager.sh&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=Y8e2G-l3Q_hhzX4wQXv4ta08fqVSctieeKtAfRLiiiU&e=>
[2] https://github.com/apache/flink/commit/d61664ca64bcb82c4e8ddf03a2ed38fe8edafa98<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_flink_commit_d61664ca64bcb82c4e8ddf03a2ed38fe8edafa98&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=OMtNVCMgKGinpOdJIzJFpN7jTHfYdG__HGAi89iFr7Y&e=>

On Fri, Mar 15, 2019 at 3:36 AM Kumar Bolar, Harshith <hk...@arity.com>> wrote:
Hi Gary,

An update. I noticed the line “–host cluster” in the program arguments section of the job manager logs. So, I commented the following section in jobmanager.sh<https://urldefense.proofpoint.com/v2/url?u=http-3A__jobmanager.sh&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=ucI2Ko8YXht8q_dDoC1y1PFDNCR71WMQhOsNmEHaTQ8&e=>, the task manager is now able to connect to job manager without issues.

  if [ ! -z $HOST ]; then
        args+=("--host")
        args+=("${HOST}")
fi


Task manager logs after commenting those lines:


2019-03-14 22:31:02,863 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at akka://flink/user/taskmanager_0 .
2019-03-14 22:31:02,875 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-03-14 22:31:02,876 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService        - Start job leader service.
2019-03-14 22:31:02,877 INFO  org.apache.flink.runtime.filecache.FileCache                  - User file cache uses directory /tmp/flink-dist-cache-12d5905f-d694-46f6-9359-3a636188b008
2019-03-14 22:31:02,884 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Connecting to ResourceManager akka.tcp://flink@flink0-1.flink1.us-east-1.high.ue1.non.aws.cloud.arity.com:28945/user/resourcemanager(8583b335fd08a30a89585b7af07e4213)<http://flink@flink0-1.flink1.us-east-1.high.ue1.non.aws.cloud.arity.com:28945/user/resourcemanager(8583b335fd08a30a89585b7af07e4213)>.
2019-03-14 22:31:03,109 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Resolved ResourceManager address, beginning registration
2019-03-14 22:31:03,110 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Registration at ResourceManager attempt 1 (timeout=100ms)
2019-03-14 22:31:03,228 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Registration at ResourceManager attempt 2 (timeout=200ms)
2019-03-14 22:31:03,266 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Successful registration at resource manager akka.tcp://flink@flink0-1.flink1.us-east-1.abc.com:28945/user/resourcemanager<https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink0-2D1.flink1.us-2Deast-2D1.abc.com-3A28945_user_resourcemanager&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=8sclFNDXXxcpveP2rVMT9IV0EDutln2sH1Wjqts1LDc&e=> under registration id 170ee6a00f80ee02ead0e88710093d77.


Thanks,
Harshith

From: Harshith Kumar Bolar <hk...@arity.com>>
Date: Friday, 15 March 2019 at 7:38 AM
To: Gary Yao <ga...@ververica.com>>
Cc: user <us...@flink.apache.org>>
Subject: Re: [External] Re: Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

Hi Gary,

Here are the full job manager and task manager logs. In the job manager logs, I see it says “starting StandaloneSessionClusterEntrypoint”, whereas in Flink 1.4.2, it used to say “starting JobManager”. Is this correct?

Job manager logs: https://paste.ubuntu.com/p/DCVzsQdpHq/<https://urldefense.proofpoint.com/v2/url?u=https-3A__paste.ubuntu.com_p_DCVzsQdpHq_&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=JzWlmLhDDSHq1XWZIZcc2QsBkNKbzbIrXEQAUR_USpQ&e=> (https://paste(.)ubuntu(.)com/p/DCVzsQdpHq /<https://urldefense.proofpoint.com/v2/url?u=https-3A__paste-28.-29ubuntu-28.-29com_p_DCVzsQdpHq-2520_&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=iHoPp3yNAYsf7Br59RaHYI6bpj5Mow7APuTQK-OcBK8&e=>)
Task Manager logs: https://paste.ubuntu.com/p/wbvYFZxdT8/<https://urldefense.proofpoint.com/v2/url?u=https-3A__paste.ubuntu.com_p_wbvYFZxdT8_&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=FF_P5g4n1WW1NgjMy-euWbnr1dlWNlpjKpYD3-R8VbM&e=> (https://paste(.)ubuntu(.)com/p/wbvYFZxdT8/)<https://urldefense.proofpoint.com/v2/url?u=https-3A__paste-28.-29ubuntu-28.-29com_p_wbvYFZxdT8_-29&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=Jy1XfZGQ94_0D40szBxJ7iD8exZY1SMaTAa0fozsFrM&s=la-LhYqYvP-G81zfyM54X9-B3N7seycQMwc6vZWBTaw&e=>

Thanks,
Harshith

From: Gary Yao <ga...@ververica.com>>
Date: Thursday, 14 March 2019 at 10:11 PM
To: Harshith Kumar Bolar <hk...@arity.com>>
Cc: user <us...@flink.apache.org>>
Subject: [External] Re: Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

Hi Harshith,

The truncated log is not enough. Can you share the complete logs? If that's
not possible, I'd like to see the beginning of the log files where the cluster
configuration is logged.

The TaskManager tries to connect to the leader that is advertised in
ZooKeeper. In your case the "cluster" hostname is advertised which hints a
problem in your Flink configuration.

Best,
Gary

On Thu, Mar 14, 2019 at 4:54 PM Kumar Bolar, Harshith <hk...@arity.com>> wrote:
Hi Gary,

I’ve attached the relevant portions of the JM and TM logs.

Job Manager Logs:

2019-03-14 11:38:28,257 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
2019-03-14 11:38:28,309 INFO  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined location of main cluster component log file: /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.log
2019-03-14 11:38:28,309 INFO  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined location of main cluster component stdout file: /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.out
2019-03-14 11:38:28,527 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest endpoint listening at cluster:8080
2019-03-14 11:38:28,527 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2019-03-14 11:38:28,574 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web frontend listening at http://cluster:8080<https://urldefense.proofpoint.com/v2/url?u=http-3A__cluster-3A8080&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=jmNlhpOrwRIDbMAqwetQxCtYFQfw1xtgw6S6ji1QqDE&e=>.
2019-03-14 11:38:28,613 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2019-03-14 11:38:28,674 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2019-03-14 11:38:28,691 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2019-03-14 11:38:28,694 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-03-14 11:38:28,698 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2019-03-14 11:38:28,700 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-03-14 11:38:28,818 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@cluster:22671] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@cluster:22671]] Caused by: [cluster]
2019-03-14 11:39:09,010 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://cluster:8080<https://urldefense.proofpoint.com/v2/url?u=http-3A__cluster-3A8080&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=jmNlhpOrwRIDbMAqwetQxCtYFQfw1xtgw6S6ji1QqDE&e=> was granted leadership with leaderSessionID=bbe408fc-ef93-4328-abeb-85323db7aef7
2019-03-14 11:39:09,010 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - ResourceManager akka.tcp://flink@cluster:31794/user/resourcemanager was granted leadership with fencing token ae4c0d30d0d65a0c41565360667e48fb
2019-03-14 11:39:09,011 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Starting the SlotManager.
2019-03-14 11:39:09,012 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://flink@cluster:31794/user/dispatcher was granted leadership with fencing token c852ada2-5fd4-4ff8-80ab-c2cdd85a75d9
2019-03-14 11:39:09,017 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.

Task Manager Logs:

2019-03-14 11:42:35,790 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager uses directory /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f for spill files.
2019-03-14 11:42:35,820 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration  - Messages have a max timeout of 10000 ms
2019-03-14 11:42:35,839 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at akka://flink/user/taskmanager_0 .
2019-03-14 11:42:35,853 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-03-14 11:42:35,854 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService        - Start job leader service.
2019-03-14 11:42:35,855 INFO  org.apache.flink.runtime.filecache.FileCache                  - User file cache uses directory /tmp/flink-dist-cache-a7f67948-ab57-4cd9-b2a6-0361b53ecd26
2019-03-14 11:42:35,871 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Connecting to ResourceManager akka.tcp://flink@cluster:31794/user/resourcemanager(ae4c0d30d0d65a0c41565360667e48fb).
2019-03-14 11:42:35,963 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@cluster:31794] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@cluster:31794]] Caused by: [cluster: Name or service not known]
2019-03-14 11:42:35,964 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not resolve ResourceManager address akka.tcp://flink@cluster:31794/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@cluster:31794/user/resourcemanager..
2019-03-14 11:47:35,895 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor            - Fatal error occurred in TaskExecutor akka.tcp://flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0<https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>.
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
   at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>:1037)
   at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>:1023)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:332)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:158)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:142)
   at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
   at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
   at akka.actor.ActorCell.invoke(ActorCell.scala:495)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
   at akka.dispatch.Mailbox.run(Mailbox.scala:224)
   at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinTask.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=bv8jB1enKafGeoNgdOTLg2sbTtbMfgFehYs0GRLszts&e=>:260)
   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>:1339)
   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>:1979)
   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinWorkerThread.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=d_bm2VR2tTF2xi468xPlqDIiV2Bnq07S6kPGj6gOLN4&e=>:107)
2019-03-14 11:47:35,897 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Fatal error occurred while executing the TaskManager. Shutting it down...
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
   at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>:1037)
   at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>:1023)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:332)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:158)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:142)
   at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
   at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
   at akka.actor.ActorCell.invoke(ActorCell.scala:495)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
   at akka.dispatch.Mailbox.run(Mailbox.scala:224)
   at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinTask.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=bv8jB1enKafGeoNgdOTLg2sbTtbMfgFehYs0GRLszts&e=>:260)
   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>:1339)
   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>:1979)
   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinWorkerThread.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=d_bm2VR2tTF2xi468xPlqDIiV2Bnq07S6kPGj6gOLN4&e=>:107)
2019-03-14 11:47:35,904 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Stopping TaskExecutor akka.tcp://flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0<https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>.
2019-03-14 11:47:35,904 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-03-14 11:47:35,904 INFO  org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager  - Shutting down TaskExecutorLocalStateStoresManager.
2019-03-14 11:47:35,908 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager removed spill file directory /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f
2019-03-14 11:47:35,908 INFO  org.apache.flink.runtime.io.network.NetworkEnvironment        - Shutting down the network environment and its components.
2019-03-14 11:47:35,914 INFO  org.apache.flink.runtime.io.network.netty.NettyClient         - Successful shutdown (took 5 ms).
2019-03-14 11:47:35,917 INFO  org.apache.flink.runtime.io.network.netty.NettyServer         - Successful shutdown (took 2 ms).
2019-03-14 11:47:35,925 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService        - Stop job leader service.
2019-03-14 11:47:35,931 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Stopped TaskExecutor akka.tcp://flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0<https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>.
2019-03-14 11:47:35,931 INFO  org.apache.flink.runtime.blob.PermanentBlobCache              - Shutting down BLOB cache
2019-03-14 11:47:35,933 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Shutting down BLOB cache
2019-03-14 11:47:35,943 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-03-14 11:47:35,950 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x26977a24c4e0018 closed
2019-03-14 11:47:35,950 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x26977a24c4e0018
2019-03-14 11:47:35,950 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Stopping Akka RPC service.
2019-03-14 11:47:35,952 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Shutting down remote daemon.
2019-03-14 11:47:35,952 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remote daemon shut down; proceeding with flushing remote transports.
2019-03-14 11:47:35,959 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Shutting down remote daemon.
2019-03-14 11:47:35,966 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remote daemon shut down; proceeding with flushing remote transports.
2019-03-14 11:47:35,983 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remoting shut down.
2019-03-14 11:47:35,984 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remoting shut down.
2019-03-14 11:47:35,992 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Stopped Akka RPC service.


From: Gary Yao <ga...@ververica.com>>
Date: Thursday, 14 March 2019 at 9:06 PM
To: Harshith Kumar Bolar <hk...@arity.com>>
Cc: user <us...@flink.apache.org>>
Subject: [External] Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

Hi Harshith,

Can you share JM and TM logs?

Best,
Gary

On Thu, Mar 14, 2019 at 3:42 PM Kumar Bolar, Harshith <hk...@arity.com>> wrote:
Hi all,

I'm trying to upgrade our Flink cluster from 1.4.2 to 1.7.2

When I bring up the cluster, the task managers refuse to connect to the job managers with the following error.

        2019-03-14 10:34:41,551 WARN  akka.remote.ReliableDeliverySupervisor
        - Association with remote system [akka.tcp://flink@cluster:22671] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@cluster:22671]] Caused by: [cluster: Name or service not known]

Now, this works correctly if I add the following line into the /etc/hosts file.

        x.x.x.x job-manager-address.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__job-2Dmanager-2Daddress.com&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=04EWFpDL8G7AOCUH79K-QVwPa3NSJj7u4Qanpbrx0tg&s=KDu-Fxq2rWtLq1EmNp0DOuK0yWC6GyHwvhpbyQ8hRQg&e=> cluster

Why is Flink 1.7.2 connecting to JM using cluster in the address? Flink 1.4.2 used to have the job manager's address instead of the word cluster.

Thanks,
Harshith


Re: Re: Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

Posted by Gary Yao <ga...@ververica.com>.
I forgot to add line numbers to the first link in my previous email:


https://github.com/apache/flink/blob/c6878aca6c5aeee46581b4d6744b31049db9de95/flink-dist/src/main/flink-bin/bin/jobmanager.sh#L21-L25

On Fri, Mar 15, 2019 at 8:08 AM Gary Yao <ga...@ververica.com> wrote:

> Hi Harshith,
>
> In the jobmanager.sh script, the 2nd argument is assigned to the HOST
> variable
> [1]. How are you invoking jobmanager.sh? Prior to 1.5, the script expected
> an
> execution mode (local or cluster) but this is no longer the case [2].
>
> Best,
> Gary
>
> [1]
> https://github.com/apache/flink/blob/c6878aca6c5aeee46581b4d6744b31049db9de95/flink-dist/src/main/flink-bin/bin/jobmanager.sh
> [2]
> https://github.com/apache/flink/commit/d61664ca64bcb82c4e8ddf03a2ed38fe8edafa98
>
> On Fri, Mar 15, 2019 at 3:36 AM Kumar Bolar, Harshith <hk...@arity.com>
> wrote:
>
>> Hi Gary,
>>
>>
>>
>> An update. I noticed the line “–host cluster” in the program arguments
>> section of the job manager logs. So, I commented the following section in
>> jobmanager.sh, the task manager is now able to connect to job manager
>> without issues.
>>
>>
>>
>>   *if [ ! -z $HOST ]; then*
>>
>> *        args+=("--host")*
>>
>> *        args+=("${HOST}")*
>>
>> *fi*
>>
>>
>>
>>
>>
>> Task manager logs after commenting those lines:
>>
>>
>>
>>
>> * 2019-03-14 22:31:02,863 INFO
>> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
>> RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at
>> akka://flink/user/taskmanager_0 .*
>>
>> *2019-03-14 22:31:02,875 INFO
>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>> Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.*
>>
>> *2019-03-14 22:31:02,876 INFO
>> org.apache.flink.runtime.taskexecutor.JobLeaderService        - Start job
>> leader service.*
>>
>> *2019-03-14 22:31:02,877 INFO
>> org.apache.flink.runtime.filecache.FileCache                  - User file
>> cache uses directory
>> /tmp/flink-dist-cache-12d5905f-d694-46f6-9359-3a636188b008*
>>
>> *2019-03-14 22:31:02,884 INFO
>> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Connecting
>> to ResourceManager
>> akka.tcp://flink@flink0-1.flink1.us-east-1.high.ue1.non.aws.cloud.arity.com:28945/user/resourcemanager(8583b335fd08a30a89585b7af07e4213)
>> <http://flink@flink0-1.flink1.us-east-1.high.ue1.non.aws.cloud.arity.com:28945/user/resourcemanager(8583b335fd08a30a89585b7af07e4213)>.*
>>
>> *2019-03-14 22:31:03,109 INFO
>> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Resolved
>> ResourceManager address, beginning registration*
>>
>> *2019-03-14 22:31:03,110 INFO
>> org.apache.flink.runtime.taskexecutor.TaskExecutor            -
>> Registration at ResourceManager attempt 1 (timeout=100ms)*
>>
>> *2019-03-14 22:31:03,228 INFO
>> org.apache.flink.runtime.taskexecutor.TaskExecutor            -
>> Registration at ResourceManager attempt 2 (timeout=200ms)*
>>
>> *2019-03-14 22:31:03,266 INFO
>> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Successful
>> registration at resource manager
>> akka.tcp://flink@flink0-1.flink1.us-east-1.abc.com:28945/user/resourcemanager
>> <http://flink@flink0-1.flink1.us-east-1.abc.com:28945/user/resourcemanager>
>> under registration id 170ee6a00f80ee02ead0e88710093d77.*
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Harshith
>>
>>
>>
>> *From: *Harshith Kumar Bolar <hk...@arity.com>
>> *Date: *Friday, 15 March 2019 at 7:38 AM
>> *To: *Gary Yao <ga...@ververica.com>
>> *Cc: *user <us...@flink.apache.org>
>> *Subject: *Re: [External] Re: Re: Flink 1.7.2: Task Manager not able to
>> connect to Job Manager
>>
>>
>>
>> Hi Gary,
>>
>>
>>
>> Here are the full job manager and task manager logs. In the job manager
>> logs, I see it says “*starting StandaloneSessionClusterEntrypoint”,* whereas
>> in Flink 1.4.2, it used to say “*starting JobManager”*. Is this correct?
>>
>>
>>
>> Job manager logs: https://paste.ubuntu.com/p/DCVzsQdpHq/ (https://paste(.)ubuntu(.)com/p/DCVzsQdpHq
>> /)
>>
>> Task Manager logs: https://paste.ubuntu.com/p/wbvYFZxdT8/ (
>> https://paste(.)ubuntu(.)com/p/wbvYFZxdT8/)
>>
>>
>>
>> Thanks,
>>
>> Harshith
>>
>>
>>
>> *From: *Gary Yao <ga...@ververica.com>
>> *Date: *Thursday, 14 March 2019 at 10:11 PM
>> *To: *Harshith Kumar Bolar <hk...@arity.com>
>> *Cc: *user <us...@flink.apache.org>
>> *Subject: *[External] Re: Re: Flink 1.7.2: Task Manager not able to
>> connect to Job Manager
>>
>>
>>
>> Hi Harshith,
>>
>> The truncated log is not enough. Can you share the complete logs? If
>> that's
>> not possible, I'd like to see the beginning of the log files where the
>> cluster
>> configuration is logged.
>>
>> The TaskManager tries to connect to the leader that is advertised in
>> ZooKeeper. In your case the "cluster" hostname is advertised which hints a
>> problem in your Flink configuration.
>>
>> Best,
>> Gary
>>
>>
>>
>> On Thu, Mar 14, 2019 at 4:54 PM Kumar Bolar, Harshith <hk...@arity.com>
>> wrote:
>>
>> Hi Gary,
>>
>>
>>
>> I’ve attached the relevant portions of the JM and TM logs.
>>
>>
>>
>> *Job Manager Logs:*
>>
>> 2019-03-14 11:38:28,257 INFO
>> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
>> - State change: CONNECTED
>> 2019-03-14 11:38:28,309 INFO
>> org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined
>> location of main cluster component log file:
>> /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.log
>> 2019-03-14 11:38:28,309 INFO
>> org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined
>> location of main cluster component stdout file:
>> /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.out
>> 2019-03-14 11:38:28,527 INFO
>> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest
>> endpoint listening at cluster:8080
>> 2019-03-14 11:38:28,527 INFO
>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Starting ZooKeeperLeaderElectionService
>> ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
>> 2019-03-14 11:38:28,574 INFO
>> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web
>> frontend listening at http://cluster:8080
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__cluster-3A8080&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=jmNlhpOrwRIDbMAqwetQxCtYFQfw1xtgw6S6ji1QqDE&e=>
>> .
>> 2019-03-14 11:38:28,613 INFO
>> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
>> RPC endpoint for
>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at
>> akka://flink/user/resourcemanager .
>> 2019-03-14 11:38:28,674 INFO
>> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
>> RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher
>> at akka://flink/user/dispatcher .
>> 2019-03-14 11:38:28,691 INFO
>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Starting ZooKeeperLeaderElectionService
>> ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
>> 2019-03-14 11:38:28,694 INFO
>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>> Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
>> 2019-03-14 11:38:28,698 INFO
>> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
>> Starting ZooKeeperLeaderElectionService
>> ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
>> 2019-03-14 11:38:28,700 INFO
>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>> Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
>> 2019-03-14 11:38:28,818 WARN
>> akka.remote.ReliableDeliverySupervisor                        - Association
>> with remote system [akka.tcp://flink@cluster:22671] has failed, address
>> is now gated for [50] ms. Reason: [Association failed with
>> [akka.tcp://flink@cluster:22671]] Caused by: [cluster]
>> 2019-03-14 11:39:09,010 INFO
>> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    -
>> http://cluster:8080
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__cluster-3A8080&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=jmNlhpOrwRIDbMAqwetQxCtYFQfw1xtgw6S6ji1QqDE&e=>
>> was granted leadership with
>> leaderSessionID=bbe408fc-ef93-4328-abeb-85323db7aef7
>> 2019-03-14 11:39:09,010 INFO
>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
>> ResourceManager akka.tcp://flink@cluster:31794/user/resourcemanager was
>> granted leadership with fencing token ae4c0d30d0d65a0c41565360667e48fb
>> 2019-03-14 11:39:09,011 INFO
>> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  -
>> Starting the SlotManager.
>> 2019-03-14 11:39:09,012 INFO
>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher
>> akka.tcp://flink@cluster:31794/user/dispatcher was granted leadership
>> with fencing token c852ada2-5fd4-4ff8-80ab-c2cdd85a75d9
>> 2019-03-14 11:39:09,017 INFO
>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering
>> all persisted jobs.
>>
>> *Task Manager Logs:*
>>
>> 2019-03-14 11:42:35,790 INFO
>> org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager
>> uses directory /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f for spill
>> files.
>> 2019-03-14 11:42:35,820 INFO
>> org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration  - Messages
>> have a max timeout of 10000 ms
>> 2019-03-14 11:42:35,839 INFO
>> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
>> RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at
>> akka://flink/user/taskmanager_0 .
>> 2019-03-14 11:42:35,853 INFO
>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>> Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
>> 2019-03-14 11:42:35,854 INFO
>> org.apache.flink.runtime.taskexecutor.JobLeaderService        - Start job
>> leader service.
>> 2019-03-14 11:42:35,855 INFO
>> org.apache.flink.runtime.filecache.FileCache                  - User file
>> cache uses directory
>> /tmp/flink-dist-cache-a7f67948-ab57-4cd9-b2a6-0361b53ecd26
>> 2019-03-14 11:42:35,871 INFO
>> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Connecting
>> to ResourceManager akka.tcp://flink@cluster
>> :31794/user/resourcemanager(ae4c0d30d0d65a0c41565360667e48fb).
>> 2019-03-14 11:42:35,963 WARN
>> akka.remote.ReliableDeliverySupervisor                        - Association
>> with remote system [akka.tcp://flink@cluster:31794] has failed, address
>> is now gated for [50] ms. Reason: [Association failed with
>> [akka.tcp://flink@cluster:31794]] Caused by: [cluster: Name or service
>> not known]
>> 2019-03-14 11:42:35,964 INFO
>> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not
>> resolve ResourceManager address akka.tcp://flink@cluster:31794/user/resourcemanager,
>> retrying in 10000 ms: Could not connect to rpc endpoint under address
>> akka.tcp://flink@cluster:31794/user/resourcemanager..
>> 2019-03-14 11:47:35,895 ERROR
>> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Fatal error
>> occurred in TaskExecutor akka.tcp://
>> flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>
>> .
>> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
>> Could not register at the ResourceManager within the specified maximum
>> registration duration 300000 ms. This indicates a problem with this
>> instance. Terminating now.
>>    at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(
>> TaskExecutor.java
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>
>> :1037)
>>    at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(
>> TaskExecutor.java
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>
>> :1023)
>>    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(
>> AkkaRpcActor.java
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>
>> :332)
>>    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(
>> AkkaRpcActor.java
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>
>> :158)
>>    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(
>> AkkaRpcActor.java
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>
>> :142)
>>    at
>> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>>    at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>>    at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>>    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>>    at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>>    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>>    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>>    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>>    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinTask.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=bv8jB1enKafGeoNgdOTLg2sbTtbMfgFehYs0GRLszts&e=>
>> :260)
>>    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(
>> ForkJoinPool.java
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>
>> :1339)
>>    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>
>> :1979)
>>    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
>> ForkJoinWorkerThread.java
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinWorkerThread.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=d_bm2VR2tTF2xi468xPlqDIiV2Bnq07S6kPGj6gOLN4&e=>
>> :107)
>> 2019-03-14 11:47:35,897 ERROR
>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Fatal error
>> occurred while executing the TaskManager. Shutting it down...
>> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
>> Could not register at the ResourceManager within the specified maximum
>> registration duration 300000 ms. This indicates a problem with this
>> instance. Terminating now.
>>    at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(
>> TaskExecutor.java
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>
>> :1037)
>>    at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(
>> TaskExecutor.java
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>
>> :1023)
>>    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(
>> AkkaRpcActor.java
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>
>> :332)
>>    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(
>> AkkaRpcActor.java
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>
>> :158)
>>    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(
>> AkkaRpcActor.java
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>
>> :142)
>>    at
>> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>>    at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>>    at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>>    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>>    at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>>    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>>    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>>    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>>    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinTask.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=bv8jB1enKafGeoNgdOTLg2sbTtbMfgFehYs0GRLszts&e=>
>> :260)
>>    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(
>> ForkJoinPool.java
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>
>> :1339)
>>    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>
>> :1979)
>>    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
>> ForkJoinWorkerThread.java
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinWorkerThread.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=d_bm2VR2tTF2xi468xPlqDIiV2Bnq07S6kPGj6gOLN4&e=>
>> :107)
>> 2019-03-14 11:47:35,904 INFO
>> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Stopping
>> TaskExecutor akka.tcp://
>> flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>
>> .
>> 2019-03-14 11:47:35,904 INFO
>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
>> 2019-03-14 11:47:35,904 INFO
>> org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager  -
>> Shutting down TaskExecutorLocalStateStoresManager.
>> 2019-03-14 11:47:35,908 INFO
>> org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager
>> removed spill file directory
>> /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f
>> 2019-03-14 11:47:35,908 INFO
>> org.apache.flink.runtime.io.network.NetworkEnvironment        - Shutting
>> down the network environment and its components.
>> 2019-03-14 11:47:35,914 INFO
>> org.apache.flink.runtime.io.network.netty.NettyClient         - Successful
>> shutdown (took 5 ms).
>> 2019-03-14 11:47:35,917 INFO
>> org.apache.flink.runtime.io.network.netty.NettyServer         - Successful
>> shutdown (took 2 ms).
>> 2019-03-14 11:47:35,925 INFO
>> org.apache.flink.runtime.taskexecutor.JobLeaderService        - Stop job
>> leader service.
>> 2019-03-14 11:47:35,931 INFO
>> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Stopped
>> TaskExecutor akka.tcp://
>> flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>
>> .
>> 2019-03-14 11:47:35,931 INFO
>> org.apache.flink.runtime.blob.PermanentBlobCache              - Shutting
>> down BLOB cache
>> 2019-03-14 11:47:35,933 INFO
>> org.apache.flink.runtime.blob.TransientBlobCache              - Shutting
>> down BLOB cache
>> 2019-03-14 11:47:35,943 INFO
>> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl
>> - backgroundOperationsLoop exiting
>> 2019-03-14 11:47:35,950 INFO
>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  -
>> Session: 0x26977a24c4e0018 closed
>> 2019-03-14 11:47:35,950 INFO
>> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
>> EventThread shut down for session: 0x26977a24c4e0018
>> 2019-03-14 11:47:35,950 INFO
>> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Stopping
>> Akka RPC service.
>> 2019-03-14 11:47:35,952 INFO
>> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Shutting
>> down remote daemon.
>> 2019-03-14 11:47:35,952 INFO
>> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remote
>> daemon shut down; proceeding with flushing remote transports.
>> 2019-03-14 11:47:35,959 INFO
>> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Shutting
>> down remote daemon.
>> 2019-03-14 11:47:35,966 INFO
>> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remote
>> daemon shut down; proceeding with flushing remote transports.
>> 2019-03-14 11:47:35,983 INFO
>> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remoting
>> shut down.
>> 2019-03-14 11:47:35,984 INFO
>> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remoting
>> shut down.
>> 2019-03-14 11:47:35,992 INFO
>> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Stopped
>> Akka RPC service.
>>
>>
>>
>>
>>
>> *From: *Gary Yao <ga...@ververica.com>
>> *Date: *Thursday, 14 March 2019 at 9:06 PM
>> *To: *Harshith Kumar Bolar <hk...@arity.com>
>> *Cc: *user <us...@flink.apache.org>
>> *Subject: *[External] Re: Flink 1.7.2: Task Manager not able to connect
>> to Job Manager
>>
>>
>>
>> Hi Harshith,
>>
>>
>>
>> Can you share JM and TM logs?
>>
>>
>>
>> Best,
>>
>> Gary
>>
>>
>>
>> On Thu, Mar 14, 2019 at 3:42 PM Kumar Bolar, Harshith <hk...@arity.com>
>> wrote:
>>
>> Hi all,
>>
>>
>>
>> I'm trying to upgrade our Flink cluster from 1.4.2 to 1.7.2
>>
>>
>>
>> When I bring up the cluster, the task managers refuse to connect to the
>> job managers with the following error.
>>
>>
>>
>>         2019-03-14 10:34:41,551 WARN
>> akka.remote.ReliableDeliverySupervisor
>>
>>         - Association with remote system [akka.tcp://flink@cluster:22671]
>> has failed, address is now gated for [50] ms. Reason: [Association failed
>> with [akka.tcp://flink@cluster:22671]] Caused by: [cluster: Name or
>> service not known]
>>
>>
>>
>> Now, this works correctly if I add the following line into
>> the /etc/hosts file.
>>
>>
>>
>>         x.x.x.x job-manager-address.com
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__job-2Dmanager-2Daddress.com&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=04EWFpDL8G7AOCUH79K-QVwPa3NSJj7u4Qanpbrx0tg&s=KDu-Fxq2rWtLq1EmNp0DOuK0yWC6GyHwvhpbyQ8hRQg&e=>
>> cluster
>>
>>
>>
>> Why is Flink 1.7.2 connecting to JM using cluster in the address? Flink
>> 1.4.2 used to have the job manager's address instead of the word cluster.
>>
>>
>>
>> Thanks,
>>
>> Harshith
>>
>>
>>
>>

Re: Re: Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

Posted by Gary Yao <ga...@ververica.com>.
Hi Harshith,

In the jobmanager.sh script, the 2nd argument is assigned to the HOST
variable
[1]. How are you invoking jobmanager.sh? Prior to 1.5, the script expected
an
execution mode (local or cluster) but this is no longer the case [2].

Best,
Gary

[1]
https://github.com/apache/flink/blob/c6878aca6c5aeee46581b4d6744b31049db9de95/flink-dist/src/main/flink-bin/bin/jobmanager.sh
[2]
https://github.com/apache/flink/commit/d61664ca64bcb82c4e8ddf03a2ed38fe8edafa98

On Fri, Mar 15, 2019 at 3:36 AM Kumar Bolar, Harshith <hk...@arity.com>
wrote:

> Hi Gary,
>
>
>
> An update. I noticed the line “–host cluster” in the program arguments
> section of the job manager logs. So, I commented the following section in
> jobmanager.sh, the task manager is now able to connect to job manager
> without issues.
>
>
>
>   *if [ ! -z $HOST ]; then*
>
> *        args+=("--host")*
>
> *        args+=("${HOST}")*
>
> *fi*
>
>
>
>
>
> Task manager logs after commenting those lines:
>
>
>
>
> * 2019-03-14 22:31:02,863 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
> RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at
> akka://flink/user/taskmanager_0 .*
>
> *2019-03-14 22:31:02,875 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.*
>
> *2019-03-14 22:31:02,876 INFO
> org.apache.flink.runtime.taskexecutor.JobLeaderService        - Start job
> leader service.*
>
> *2019-03-14 22:31:02,877 INFO
> org.apache.flink.runtime.filecache.FileCache                  - User file
> cache uses directory
> /tmp/flink-dist-cache-12d5905f-d694-46f6-9359-3a636188b008*
>
> *2019-03-14 22:31:02,884 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Connecting
> to ResourceManager
> akka.tcp://flink@flink0-1.flink1.us-east-1.high.ue1.non.aws.cloud.arity.com:28945/user/resourcemanager(8583b335fd08a30a89585b7af07e4213)
> <http://flink@flink0-1.flink1.us-east-1.high.ue1.non.aws.cloud.arity.com:28945/user/resourcemanager(8583b335fd08a30a89585b7af07e4213)>.*
>
> *2019-03-14 22:31:03,109 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Resolved
> ResourceManager address, beginning registration*
>
> *2019-03-14 22:31:03,110 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            -
> Registration at ResourceManager attempt 1 (timeout=100ms)*
>
> *2019-03-14 22:31:03,228 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            -
> Registration at ResourceManager attempt 2 (timeout=200ms)*
>
> *2019-03-14 22:31:03,266 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Successful
> registration at resource manager
> akka.tcp://flink@flink0-1.flink1.us-east-1.abc.com:28945/user/resourcemanager
> <http://flink@flink0-1.flink1.us-east-1.abc.com:28945/user/resourcemanager>
> under registration id 170ee6a00f80ee02ead0e88710093d77.*
>
>
>
>
>
> Thanks,
>
> Harshith
>
>
>
> *From: *Harshith Kumar Bolar <hk...@arity.com>
> *Date: *Friday, 15 March 2019 at 7:38 AM
> *To: *Gary Yao <ga...@ververica.com>
> *Cc: *user <us...@flink.apache.org>
> *Subject: *Re: [External] Re: Re: Flink 1.7.2: Task Manager not able to
> connect to Job Manager
>
>
>
> Hi Gary,
>
>
>
> Here are the full job manager and task manager logs. In the job manager
> logs, I see it says “*starting StandaloneSessionClusterEntrypoint”,* whereas
> in Flink 1.4.2, it used to say “*starting JobManager”*. Is this correct?
>
>
>
> Job manager logs: https://paste.ubuntu.com/p/DCVzsQdpHq/ (https://paste(.)ubuntu(.)com/p/DCVzsQdpHq
> /)
>
> Task Manager logs: https://paste.ubuntu.com/p/wbvYFZxdT8/ (
> https://paste(.)ubuntu(.)com/p/wbvYFZxdT8/)
>
>
>
> Thanks,
>
> Harshith
>
>
>
> *From: *Gary Yao <ga...@ververica.com>
> *Date: *Thursday, 14 March 2019 at 10:11 PM
> *To: *Harshith Kumar Bolar <hk...@arity.com>
> *Cc: *user <us...@flink.apache.org>
> *Subject: *[External] Re: Re: Flink 1.7.2: Task Manager not able to
> connect to Job Manager
>
>
>
> Hi Harshith,
>
> The truncated log is not enough. Can you share the complete logs? If that's
> not possible, I'd like to see the beginning of the log files where the
> cluster
> configuration is logged.
>
> The TaskManager tries to connect to the leader that is advertised in
> ZooKeeper. In your case the "cluster" hostname is advertised which hints a
> problem in your Flink configuration.
>
> Best,
> Gary
>
>
>
> On Thu, Mar 14, 2019 at 4:54 PM Kumar Bolar, Harshith <hk...@arity.com>
> wrote:
>
> Hi Gary,
>
>
>
> I’ve attached the relevant portions of the JM and TM logs.
>
>
>
> *Job Manager Logs:*
>
> 2019-03-14 11:38:28,257 INFO
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
> - State change: CONNECTED
> 2019-03-14 11:38:28,309 INFO
> org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined
> location of main cluster component log file:
> /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.log
> 2019-03-14 11:38:28,309 INFO
> org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined
> location of main cluster component stdout file:
> /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.out
> 2019-03-14 11:38:28,527 INFO
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest
> endpoint listening at cluster:8080
> 2019-03-14 11:38:28,527 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Starting ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
> 2019-03-14 11:38:28,574 INFO
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web
> frontend listening at http://cluster:8080
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__cluster-3A8080&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=jmNlhpOrwRIDbMAqwetQxCtYFQfw1xtgw6S6ji1QqDE&e=>
> .
> 2019-03-14 11:38:28,613 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
> RPC endpoint for
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at
> akka://flink/user/resourcemanager .
> 2019-03-14 11:38:28,674 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
> RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher
> at akka://flink/user/dispatcher .
> 2019-03-14 11:38:28,691 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Starting ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
> 2019-03-14 11:38:28,694 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2019-03-14 11:38:28,698 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Starting ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
> 2019-03-14 11:38:28,700 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
> 2019-03-14 11:38:28,818 WARN
> akka.remote.ReliableDeliverySupervisor                        - Association
> with remote system [akka.tcp://flink@cluster:22671] has failed, address
> is now gated for [50] ms. Reason: [Association failed with
> [akka.tcp://flink@cluster:22671]] Caused by: [cluster]
> 2019-03-14 11:39:09,010 INFO
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    -
> http://cluster:8080
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__cluster-3A8080&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=jmNlhpOrwRIDbMAqwetQxCtYFQfw1xtgw6S6ji1QqDE&e=>
> was granted leadership with
> leaderSessionID=bbe408fc-ef93-4328-abeb-85323db7aef7
> 2019-03-14 11:39:09,010 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
> ResourceManager akka.tcp://flink@cluster:31794/user/resourcemanager was
> granted leadership with fencing token ae4c0d30d0d65a0c41565360667e48fb
> 2019-03-14 11:39:09,011 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  -
> Starting the SlotManager.
> 2019-03-14 11:39:09,012 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher
> akka.tcp://flink@cluster:31794/user/dispatcher was granted leadership
> with fencing token c852ada2-5fd4-4ff8-80ab-c2cdd85a75d9
> 2019-03-14 11:39:09,017 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering
> all persisted jobs.
>
> *Task Manager Logs:*
>
> 2019-03-14 11:42:35,790 INFO
> org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager
> uses directory /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f for spill
> files.
> 2019-03-14 11:42:35,820 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration  - Messages
> have a max timeout of 10000 ms
> 2019-03-14 11:42:35,839 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
> RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at
> akka://flink/user/taskmanager_0 .
> 2019-03-14 11:42:35,853 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2019-03-14 11:42:35,854 INFO
> org.apache.flink.runtime.taskexecutor.JobLeaderService        - Start job
> leader service.
> 2019-03-14 11:42:35,855 INFO
> org.apache.flink.runtime.filecache.FileCache                  - User file
> cache uses directory
> /tmp/flink-dist-cache-a7f67948-ab57-4cd9-b2a6-0361b53ecd26
> 2019-03-14 11:42:35,871 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Connecting
> to ResourceManager akka.tcp://flink@cluster
> :31794/user/resourcemanager(ae4c0d30d0d65a0c41565360667e48fb).
> 2019-03-14 11:42:35,963 WARN
> akka.remote.ReliableDeliverySupervisor                        - Association
> with remote system [akka.tcp://flink@cluster:31794] has failed, address
> is now gated for [50] ms. Reason: [Association failed with
> [akka.tcp://flink@cluster:31794]] Caused by: [cluster: Name or service
> not known]
> 2019-03-14 11:42:35,964 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not
> resolve ResourceManager address akka.tcp://flink@cluster:31794/user/resourcemanager,
> retrying in 10000 ms: Could not connect to rpc endpoint under address
> akka.tcp://flink@cluster:31794/user/resourcemanager..
> 2019-03-14 11:47:35,895 ERROR
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Fatal error
> occurred in TaskExecutor akka.tcp://
> flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>
> .
> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
> Could not register at the ResourceManager within the specified maximum
> registration duration 300000 ms. This indicates a problem with this
> instance. Terminating now.
>    at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(
> TaskExecutor.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>
> :1037)
>    at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(
> TaskExecutor.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>
> :1023)
>    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(
> AkkaRpcActor.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>
> :332)
>    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(
> AkkaRpcActor.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>
> :158)
>    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(
> AkkaRpcActor.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>
> :142)
>    at
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>    at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>    at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>    at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinTask.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=bv8jB1enKafGeoNgdOTLg2sbTtbMfgFehYs0GRLszts&e=>
> :260)
>    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(
> ForkJoinPool.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>
> :1339)
>    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>
> :1979)
>    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
> ForkJoinWorkerThread.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinWorkerThread.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=d_bm2VR2tTF2xi468xPlqDIiV2Bnq07S6kPGj6gOLN4&e=>
> :107)
> 2019-03-14 11:47:35,897 ERROR
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Fatal error
> occurred while executing the TaskManager. Shutting it down...
> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
> Could not register at the ResourceManager within the specified maximum
> registration duration 300000 ms. This indicates a problem with this
> instance. Terminating now.
>    at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(
> TaskExecutor.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>
> :1037)
>    at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(
> TaskExecutor.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>
> :1023)
>    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(
> AkkaRpcActor.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>
> :332)
>    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(
> AkkaRpcActor.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>
> :158)
>    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(
> AkkaRpcActor.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>
> :142)
>    at
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>    at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>    at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>    at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinTask.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=bv8jB1enKafGeoNgdOTLg2sbTtbMfgFehYs0GRLszts&e=>
> :260)
>    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(
> ForkJoinPool.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>
> :1339)
>    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>
> :1979)
>    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
> ForkJoinWorkerThread.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinWorkerThread.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=d_bm2VR2tTF2xi468xPlqDIiV2Bnq07S6kPGj6gOLN4&e=>
> :107)
> 2019-03-14 11:47:35,904 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Stopping
> TaskExecutor akka.tcp://
> flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>
> .
> 2019-03-14 11:47:35,904 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2019-03-14 11:47:35,904 INFO
> org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager  -
> Shutting down TaskExecutorLocalStateStoresManager.
> 2019-03-14 11:47:35,908 INFO
> org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager
> removed spill file directory
> /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f
> 2019-03-14 11:47:35,908 INFO
> org.apache.flink.runtime.io.network.NetworkEnvironment        - Shutting
> down the network environment and its components.
> 2019-03-14 11:47:35,914 INFO
> org.apache.flink.runtime.io.network.netty.NettyClient         - Successful
> shutdown (took 5 ms).
> 2019-03-14 11:47:35,917 INFO
> org.apache.flink.runtime.io.network.netty.NettyServer         - Successful
> shutdown (took 2 ms).
> 2019-03-14 11:47:35,925 INFO
> org.apache.flink.runtime.taskexecutor.JobLeaderService        - Stop job
> leader service.
> 2019-03-14 11:47:35,931 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Stopped
> TaskExecutor akka.tcp://
> flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>
> .
> 2019-03-14 11:47:35,931 INFO
> org.apache.flink.runtime.blob.PermanentBlobCache              - Shutting
> down BLOB cache
> 2019-03-14 11:47:35,933 INFO
> org.apache.flink.runtime.blob.TransientBlobCache              - Shutting
> down BLOB cache
> 2019-03-14 11:47:35,943 INFO
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl
> - backgroundOperationsLoop exiting
> 2019-03-14 11:47:35,950 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  -
> Session: 0x26977a24c4e0018 closed
> 2019-03-14 11:47:35,950 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
> EventThread shut down for session: 0x26977a24c4e0018
> 2019-03-14 11:47:35,950 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Stopping
> Akka RPC service.
> 2019-03-14 11:47:35,952 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Shutting
> down remote daemon.
> 2019-03-14 11:47:35,952 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remote
> daemon shut down; proceeding with flushing remote transports.
> 2019-03-14 11:47:35,959 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Shutting
> down remote daemon.
> 2019-03-14 11:47:35,966 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remote
> daemon shut down; proceeding with flushing remote transports.
> 2019-03-14 11:47:35,983 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remoting
> shut down.
> 2019-03-14 11:47:35,984 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remoting
> shut down.
> 2019-03-14 11:47:35,992 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Stopped
> Akka RPC service.
>
>
>
>
>
> *From: *Gary Yao <ga...@ververica.com>
> *Date: *Thursday, 14 March 2019 at 9:06 PM
> *To: *Harshith Kumar Bolar <hk...@arity.com>
> *Cc: *user <us...@flink.apache.org>
> *Subject: *[External] Re: Flink 1.7.2: Task Manager not able to connect
> to Job Manager
>
>
>
> Hi Harshith,
>
>
>
> Can you share JM and TM logs?
>
>
>
> Best,
>
> Gary
>
>
>
> On Thu, Mar 14, 2019 at 3:42 PM Kumar Bolar, Harshith <hk...@arity.com>
> wrote:
>
> Hi all,
>
>
>
> I'm trying to upgrade our Flink cluster from 1.4.2 to 1.7.2
>
>
>
> When I bring up the cluster, the task managers refuse to connect to the
> job managers with the following error.
>
>
>
>         2019-03-14 10:34:41,551 WARN
> akka.remote.ReliableDeliverySupervisor
>
>         - Association with remote system [akka.tcp://flink@cluster:22671]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@cluster:22671]] Caused by: [cluster: Name or
> service not known]
>
>
>
> Now, this works correctly if I add the following line into
> the /etc/hosts file.
>
>
>
>         x.x.x.x job-manager-address.com
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__job-2Dmanager-2Daddress.com&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=04EWFpDL8G7AOCUH79K-QVwPa3NSJj7u4Qanpbrx0tg&s=KDu-Fxq2rWtLq1EmNp0DOuK0yWC6GyHwvhpbyQ8hRQg&e=>
> cluster
>
>
>
> Why is Flink 1.7.2 connecting to JM using cluster in the address? Flink
> 1.4.2 used to have the job manager's address instead of the word cluster.
>
>
>
> Thanks,
>
> Harshith
>
>
>
>

Re: Re: Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

Posted by "Kumar Bolar, Harshith" <hk...@arity.com>.
Hi Gary,

An update. I noticed the line “–host cluster” in the program arguments section of the job manager logs. So, I commented the following section in jobmanager.sh, the task manager is now able to connect to job manager without issues.

  if [ ! -z $HOST ]; then
        args+=("--host")
        args+=("${HOST}")
fi


Task manager logs after commenting those lines:


2019-03-14 22:31:02,863 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at akka://flink/user/taskmanager_0 .
2019-03-14 22:31:02,875 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-03-14 22:31:02,876 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService        - Start job leader service.
2019-03-14 22:31:02,877 INFO  org.apache.flink.runtime.filecache.FileCache                  - User file cache uses directory /tmp/flink-dist-cache-12d5905f-d694-46f6-9359-3a636188b008
2019-03-14 22:31:02,884 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Connecting to ResourceManager akka.tcp://flink@flink0-1.flink1.us-east-1.high.ue1.non.aws.cloud.arity.com:28945/user/resourcemanager(8583b335fd08a30a89585b7af07e4213).
2019-03-14 22:31:03,109 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Resolved ResourceManager address, beginning registration
2019-03-14 22:31:03,110 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Registration at ResourceManager attempt 1 (timeout=100ms)
2019-03-14 22:31:03,228 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Registration at ResourceManager attempt 2 (timeout=200ms)
2019-03-14 22:31:03,266 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Successful registration at resource manager akka.tcp://flink@flink0-1.flink1.us-east-1.abc.com:28945/user/resourcemanager under registration id 170ee6a00f80ee02ead0e88710093d77.


Thanks,
Harshith

From: Harshith Kumar Bolar <hk...@arity.com>
Date: Friday, 15 March 2019 at 7:38 AM
To: Gary Yao <ga...@ververica.com>
Cc: user <us...@flink.apache.org>
Subject: Re: [External] Re: Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

Hi Gary,

Here are the full job manager and task manager logs. In the job manager logs, I see it says “starting StandaloneSessionClusterEntrypoint”, whereas in Flink 1.4.2, it used to say “starting JobManager”. Is this correct?

Job manager logs: https://paste.ubuntu.com/p/DCVzsQdpHq/ (https://paste(.)ubuntu(.)com/p/DCVzsQdpHq /<https://paste(.)ubuntu(.)com/p/DCVzsQdpHq%20/>)
Task Manager logs: https://paste.ubuntu.com/p/wbvYFZxdT8/ (https://paste(.)ubuntu(.)com/p/wbvYFZxdT8/)

Thanks,
Harshith

From: Gary Yao <ga...@ververica.com>
Date: Thursday, 14 March 2019 at 10:11 PM
To: Harshith Kumar Bolar <hk...@arity.com>
Cc: user <us...@flink.apache.org>
Subject: [External] Re: Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

Hi Harshith,

The truncated log is not enough. Can you share the complete logs? If that's
not possible, I'd like to see the beginning of the log files where the cluster
configuration is logged.

The TaskManager tries to connect to the leader that is advertised in
ZooKeeper. In your case the "cluster" hostname is advertised which hints a
problem in your Flink configuration.

Best,
Gary

On Thu, Mar 14, 2019 at 4:54 PM Kumar Bolar, Harshith <hk...@arity.com>> wrote:
Hi Gary,

I’ve attached the relevant portions of the JM and TM logs.

Job Manager Logs:

2019-03-14 11:38:28,257 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
2019-03-14 11:38:28,309 INFO  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined location of main cluster component log file: /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.log
2019-03-14 11:38:28,309 INFO  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined location of main cluster component stdout file: /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.out
2019-03-14 11:38:28,527 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest endpoint listening at cluster:8080
2019-03-14 11:38:28,527 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2019-03-14 11:38:28,574 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web frontend listening at http://cluster:8080<https://urldefense.proofpoint.com/v2/url?u=http-3A__cluster-3A8080&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=jmNlhpOrwRIDbMAqwetQxCtYFQfw1xtgw6S6ji1QqDE&e=>.
2019-03-14 11:38:28,613 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2019-03-14 11:38:28,674 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2019-03-14 11:38:28,691 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2019-03-14 11:38:28,694 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-03-14 11:38:28,698 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2019-03-14 11:38:28,700 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-03-14 11:38:28,818 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@cluster:22671] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@cluster:22671]] Caused by: [cluster]
2019-03-14 11:39:09,010 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://cluster:8080<https://urldefense.proofpoint.com/v2/url?u=http-3A__cluster-3A8080&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=jmNlhpOrwRIDbMAqwetQxCtYFQfw1xtgw6S6ji1QqDE&e=> was granted leadership with leaderSessionID=bbe408fc-ef93-4328-abeb-85323db7aef7
2019-03-14 11:39:09,010 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - ResourceManager akka.tcp://flink@cluster:31794/user/resourcemanager was granted leadership with fencing token ae4c0d30d0d65a0c41565360667e48fb
2019-03-14 11:39:09,011 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Starting the SlotManager.
2019-03-14 11:39:09,012 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://flink@cluster:31794/user/dispatcher was granted leadership with fencing token c852ada2-5fd4-4ff8-80ab-c2cdd85a75d9
2019-03-14 11:39:09,017 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.

Task Manager Logs:

2019-03-14 11:42:35,790 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager uses directory /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f for spill files.
2019-03-14 11:42:35,820 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration  - Messages have a max timeout of 10000 ms
2019-03-14 11:42:35,839 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at akka://flink/user/taskmanager_0 .
2019-03-14 11:42:35,853 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-03-14 11:42:35,854 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService        - Start job leader service.
2019-03-14 11:42:35,855 INFO  org.apache.flink.runtime.filecache.FileCache                  - User file cache uses directory /tmp/flink-dist-cache-a7f67948-ab57-4cd9-b2a6-0361b53ecd26
2019-03-14 11:42:35,871 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Connecting to ResourceManager akka.tcp://flink@cluster:31794/user/resourcemanager(ae4c0d30d0d65a0c41565360667e48fb).
2019-03-14 11:42:35,963 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@cluster:31794] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@cluster:31794]] Caused by: [cluster: Name or service not known]
2019-03-14 11:42:35,964 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not resolve ResourceManager address akka.tcp://flink@cluster:31794/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@cluster:31794/user/resourcemanager..
2019-03-14 11:47:35,895 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor            - Fatal error occurred in TaskExecutor akka.tcp://flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0<https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>.
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
   at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>:1037)
   at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>:1023)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:332)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:158)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:142)
   at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
   at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
   at akka.actor.ActorCell.invoke(ActorCell.scala:495)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
   at akka.dispatch.Mailbox.run(Mailbox.scala:224)
   at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinTask.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=bv8jB1enKafGeoNgdOTLg2sbTtbMfgFehYs0GRLszts&e=>:260)
   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>:1339)
   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>:1979)
   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinWorkerThread.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=d_bm2VR2tTF2xi468xPlqDIiV2Bnq07S6kPGj6gOLN4&e=>:107)
2019-03-14 11:47:35,897 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Fatal error occurred while executing the TaskManager. Shutting it down...
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
   at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>:1037)
   at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>:1023)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:332)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:158)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:142)
   at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
   at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
   at akka.actor.ActorCell.invoke(ActorCell.scala:495)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
   at akka.dispatch.Mailbox.run(Mailbox.scala:224)
   at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinTask.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=bv8jB1enKafGeoNgdOTLg2sbTtbMfgFehYs0GRLszts&e=>:260)
   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>:1339)
   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>:1979)
   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinWorkerThread.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=d_bm2VR2tTF2xi468xPlqDIiV2Bnq07S6kPGj6gOLN4&e=>:107)
2019-03-14 11:47:35,904 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Stopping TaskExecutor akka.tcp://flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0<https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>.
2019-03-14 11:47:35,904 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-03-14 11:47:35,904 INFO  org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager  - Shutting down TaskExecutorLocalStateStoresManager.
2019-03-14 11:47:35,908 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager removed spill file directory /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f
2019-03-14 11:47:35,908 INFO  org.apache.flink.runtime.io.network.NetworkEnvironment        - Shutting down the network environment and its components.
2019-03-14 11:47:35,914 INFO  org.apache.flink.runtime.io.network.netty.NettyClient         - Successful shutdown (took 5 ms).
2019-03-14 11:47:35,917 INFO  org.apache.flink.runtime.io.network.netty.NettyServer         - Successful shutdown (took 2 ms).
2019-03-14 11:47:35,925 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService        - Stop job leader service.
2019-03-14 11:47:35,931 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Stopped TaskExecutor akka.tcp://flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0<https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>.
2019-03-14 11:47:35,931 INFO  org.apache.flink.runtime.blob.PermanentBlobCache              - Shutting down BLOB cache
2019-03-14 11:47:35,933 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Shutting down BLOB cache
2019-03-14 11:47:35,943 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-03-14 11:47:35,950 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x26977a24c4e0018 closed
2019-03-14 11:47:35,950 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x26977a24c4e0018
2019-03-14 11:47:35,950 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Stopping Akka RPC service.
2019-03-14 11:47:35,952 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Shutting down remote daemon.
2019-03-14 11:47:35,952 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remote daemon shut down; proceeding with flushing remote transports.
2019-03-14 11:47:35,959 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Shutting down remote daemon.
2019-03-14 11:47:35,966 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remote daemon shut down; proceeding with flushing remote transports.
2019-03-14 11:47:35,983 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remoting shut down.
2019-03-14 11:47:35,984 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remoting shut down.
2019-03-14 11:47:35,992 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Stopped Akka RPC service.


From: Gary Yao <ga...@ververica.com>>
Date: Thursday, 14 March 2019 at 9:06 PM
To: Harshith Kumar Bolar <hk...@arity.com>>
Cc: user <us...@flink.apache.org>>
Subject: [External] Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

Hi Harshith,

Can you share JM and TM logs?

Best,
Gary

On Thu, Mar 14, 2019 at 3:42 PM Kumar Bolar, Harshith <hk...@arity.com>> wrote:
Hi all,

I'm trying to upgrade our Flink cluster from 1.4.2 to 1.7.2

When I bring up the cluster, the task managers refuse to connect to the job managers with the following error.

        2019-03-14 10:34:41,551 WARN  akka.remote.ReliableDeliverySupervisor
        - Association with remote system [akka.tcp://flink@cluster:22671] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@cluster:22671]] Caused by: [cluster: Name or service not known]

Now, this works correctly if I add the following line into the /etc/hosts file.

        x.x.x.x job-manager-address.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__job-2Dmanager-2Daddress.com&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=04EWFpDL8G7AOCUH79K-QVwPa3NSJj7u4Qanpbrx0tg&s=KDu-Fxq2rWtLq1EmNp0DOuK0yWC6GyHwvhpbyQ8hRQg&e=> cluster

Why is Flink 1.7.2 connecting to JM using cluster in the address? Flink 1.4.2 used to have the job manager's address instead of the word cluster.

Thanks,
Harshith


Re: Re: Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

Posted by "Kumar Bolar, Harshith" <hk...@arity.com>.
Hi Gary,

Here are the full job manager and task manager logs. In the job manager logs, I see it says “starting StandaloneSessionClusterEntrypoint”, whereas in Flink 1.4.2, it used to say “starting JobManager”. Is this correct?

Job manager logs: https://paste.ubuntu.com/p/DCVzsQdpHq/ (https://paste(.)ubuntu(.)com/p/DCVzsQdpHq /<https://paste(.)ubuntu(.)com/p/DCVzsQdpHq%20/>)
Task Manager logs: https://paste.ubuntu.com/p/wbvYFZxdT8/ (https://paste(.)ubuntu(.)com/p/wbvYFZxdT8/)

Thanks,
Harshith

From: Gary Yao <ga...@ververica.com>
Date: Thursday, 14 March 2019 at 10:11 PM
To: Harshith Kumar Bolar <hk...@arity.com>
Cc: user <us...@flink.apache.org>
Subject: [External] Re: Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

Hi Harshith,

The truncated log is not enough. Can you share the complete logs? If that's
not possible, I'd like to see the beginning of the log files where the cluster
configuration is logged.

The TaskManager tries to connect to the leader that is advertised in
ZooKeeper. In your case the "cluster" hostname is advertised which hints a
problem in your Flink configuration.

Best,
Gary

On Thu, Mar 14, 2019 at 4:54 PM Kumar Bolar, Harshith <hk...@arity.com>> wrote:
Hi Gary,

I’ve attached the relevant portions of the JM and TM logs.

Job Manager Logs:

2019-03-14 11:38:28,257 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
2019-03-14 11:38:28,309 INFO  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined location of main cluster component log file: /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.log
2019-03-14 11:38:28,309 INFO  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined location of main cluster component stdout file: /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.out
2019-03-14 11:38:28,527 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest endpoint listening at cluster:8080
2019-03-14 11:38:28,527 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2019-03-14 11:38:28,574 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web frontend listening at http://cluster:8080<https://urldefense.proofpoint.com/v2/url?u=http-3A__cluster-3A8080&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=jmNlhpOrwRIDbMAqwetQxCtYFQfw1xtgw6S6ji1QqDE&e=>.
2019-03-14 11:38:28,613 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2019-03-14 11:38:28,674 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2019-03-14 11:38:28,691 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2019-03-14 11:38:28,694 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-03-14 11:38:28,698 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2019-03-14 11:38:28,700 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-03-14 11:38:28,818 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@cluster:22671] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@cluster:22671]] Caused by: [cluster]
2019-03-14 11:39:09,010 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://cluster:8080<https://urldefense.proofpoint.com/v2/url?u=http-3A__cluster-3A8080&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=jmNlhpOrwRIDbMAqwetQxCtYFQfw1xtgw6S6ji1QqDE&e=> was granted leadership with leaderSessionID=bbe408fc-ef93-4328-abeb-85323db7aef7
2019-03-14 11:39:09,010 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - ResourceManager akka.tcp://flink@cluster:31794/user/resourcemanager was granted leadership with fencing token ae4c0d30d0d65a0c41565360667e48fb
2019-03-14 11:39:09,011 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Starting the SlotManager.
2019-03-14 11:39:09,012 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://flink@cluster:31794/user/dispatcher was granted leadership with fencing token c852ada2-5fd4-4ff8-80ab-c2cdd85a75d9
2019-03-14 11:39:09,017 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.

Task Manager Logs:

2019-03-14 11:42:35,790 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager uses directory /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f for spill files.
2019-03-14 11:42:35,820 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration  - Messages have a max timeout of 10000 ms
2019-03-14 11:42:35,839 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at akka://flink/user/taskmanager_0 .
2019-03-14 11:42:35,853 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-03-14 11:42:35,854 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService        - Start job leader service.
2019-03-14 11:42:35,855 INFO  org.apache.flink.runtime.filecache.FileCache                  - User file cache uses directory /tmp/flink-dist-cache-a7f67948-ab57-4cd9-b2a6-0361b53ecd26
2019-03-14 11:42:35,871 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Connecting to ResourceManager akka.tcp://flink@cluster:31794/user/resourcemanager(ae4c0d30d0d65a0c41565360667e48fb).
2019-03-14 11:42:35,963 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@cluster:31794] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@cluster:31794]] Caused by: [cluster: Name or service not known]
2019-03-14 11:42:35,964 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not resolve ResourceManager address akka.tcp://flink@cluster:31794/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@cluster:31794/user/resourcemanager..
2019-03-14 11:47:35,895 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor            - Fatal error occurred in TaskExecutor akka.tcp://flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0<https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>.
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
   at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>:1037)
   at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>:1023)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:332)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:158)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:142)
   at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
   at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
   at akka.actor.ActorCell.invoke(ActorCell.scala:495)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
   at akka.dispatch.Mailbox.run(Mailbox.scala:224)
   at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinTask.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=bv8jB1enKafGeoNgdOTLg2sbTtbMfgFehYs0GRLszts&e=>:260)
   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>:1339)
   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>:1979)
   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinWorkerThread.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=d_bm2VR2tTF2xi468xPlqDIiV2Bnq07S6kPGj6gOLN4&e=>:107)
2019-03-14 11:47:35,897 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Fatal error occurred while executing the TaskManager. Shutting it down...
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
   at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>:1037)
   at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>:1023)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:332)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:158)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>:142)
   at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
   at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
   at akka.actor.ActorCell.invoke(ActorCell.scala:495)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
   at akka.dispatch.Mailbox.run(Mailbox.scala:224)
   at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinTask.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=bv8jB1enKafGeoNgdOTLg2sbTtbMfgFehYs0GRLszts&e=>:260)
   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>:1339)
   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>:1979)
   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java<https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinWorkerThread.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=d_bm2VR2tTF2xi468xPlqDIiV2Bnq07S6kPGj6gOLN4&e=>:107)
2019-03-14 11:47:35,904 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Stopping TaskExecutor akka.tcp://flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0<https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>.
2019-03-14 11:47:35,904 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-03-14 11:47:35,904 INFO  org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager  - Shutting down TaskExecutorLocalStateStoresManager.
2019-03-14 11:47:35,908 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager removed spill file directory /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f
2019-03-14 11:47:35,908 INFO  org.apache.flink.runtime.io.network.NetworkEnvironment        - Shutting down the network environment and its components.
2019-03-14 11:47:35,914 INFO  org.apache.flink.runtime.io.network.netty.NettyClient         - Successful shutdown (took 5 ms).
2019-03-14 11:47:35,917 INFO  org.apache.flink.runtime.io.network.netty.NettyServer         - Successful shutdown (took 2 ms).
2019-03-14 11:47:35,925 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService        - Stop job leader service.
2019-03-14 11:47:35,931 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Stopped TaskExecutor akka.tcp://flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0<https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>.
2019-03-14 11:47:35,931 INFO  org.apache.flink.runtime.blob.PermanentBlobCache              - Shutting down BLOB cache
2019-03-14 11:47:35,933 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Shutting down BLOB cache
2019-03-14 11:47:35,943 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-03-14 11:47:35,950 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x26977a24c4e0018 closed
2019-03-14 11:47:35,950 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x26977a24c4e0018
2019-03-14 11:47:35,950 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Stopping Akka RPC service.
2019-03-14 11:47:35,952 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Shutting down remote daemon.
2019-03-14 11:47:35,952 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remote daemon shut down; proceeding with flushing remote transports.
2019-03-14 11:47:35,959 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Shutting down remote daemon.
2019-03-14 11:47:35,966 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remote daemon shut down; proceeding with flushing remote transports.
2019-03-14 11:47:35,983 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remoting shut down.
2019-03-14 11:47:35,984 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remoting shut down.
2019-03-14 11:47:35,992 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Stopped Akka RPC service.


From: Gary Yao <ga...@ververica.com>>
Date: Thursday, 14 March 2019 at 9:06 PM
To: Harshith Kumar Bolar <hk...@arity.com>>
Cc: user <us...@flink.apache.org>>
Subject: [External] Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

Hi Harshith,

Can you share JM and TM logs?

Best,
Gary

On Thu, Mar 14, 2019 at 3:42 PM Kumar Bolar, Harshith <hk...@arity.com>> wrote:
Hi all,

I'm trying to upgrade our Flink cluster from 1.4.2 to 1.7.2

When I bring up the cluster, the task managers refuse to connect to the job managers with the following error.

        2019-03-14 10:34:41,551 WARN  akka.remote.ReliableDeliverySupervisor
        - Association with remote system [akka.tcp://flink@cluster:22671] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@cluster:22671]] Caused by: [cluster: Name or service not known]

Now, this works correctly if I add the following line into the /etc/hosts file.

        x.x.x.x job-manager-address.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__job-2Dmanager-2Daddress.com&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=04EWFpDL8G7AOCUH79K-QVwPa3NSJj7u4Qanpbrx0tg&s=KDu-Fxq2rWtLq1EmNp0DOuK0yWC6GyHwvhpbyQ8hRQg&e=> cluster

Why is Flink 1.7.2 connecting to JM using cluster in the address? Flink 1.4.2 used to have the job manager's address instead of the word cluster.

Thanks,
Harshith


Re: Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

Posted by Gary Yao <ga...@ververica.com>.
Hi Harshith,

The truncated log is not enough. Can you share the complete logs? If that's
not possible, I'd like to see the beginning of the log files where the
cluster
configuration is logged.

The TaskManager tries to connect to the leader that is advertised in
ZooKeeper. In your case the "cluster" hostname is advertised which hints a
problem in your Flink configuration.

Best,
Gary

On Thu, Mar 14, 2019 at 4:54 PM Kumar Bolar, Harshith <hk...@arity.com>
wrote:

> Hi Gary,
>
>
>
> I’ve attached the relevant portions of the JM and TM logs.
>
>
>
> *Job Manager Logs:*
>
> 2019-03-14 11:38:28,257 INFO
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
> - State change: CONNECTED
> 2019-03-14 11:38:28,309 INFO
> org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined
> location of main cluster component log file:
> /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.log
> 2019-03-14 11:38:28,309 INFO
> org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined
> location of main cluster component stdout file:
> /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.out
> 2019-03-14 11:38:28,527 INFO
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest
> endpoint listening at cluster:8080
> 2019-03-14 11:38:28,527 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Starting ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
> 2019-03-14 11:38:28,574 INFO
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web
> frontend listening at http://cluster:8080.
> 2019-03-14 11:38:28,613 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
> RPC endpoint for
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at
> akka://flink/user/resourcemanager .
> 2019-03-14 11:38:28,674 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
> RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher
> at akka://flink/user/dispatcher .
> 2019-03-14 11:38:28,691 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Starting ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
> 2019-03-14 11:38:28,694 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2019-03-14 11:38:28,698 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Starting ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
> 2019-03-14 11:38:28,700 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
> 2019-03-14 11:38:28,818 WARN
> akka.remote.ReliableDeliverySupervisor                        - Association
> with remote system [akka.tcp://flink@cluster:22671] has failed, address
> is now gated for [50] ms. Reason: [Association failed with
> [akka.tcp://flink@cluster:22671]] Caused by: [cluster]
> 2019-03-14 11:39:09,010 INFO
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    -
> http://cluster:8080 was granted leadership with
> leaderSessionID=bbe408fc-ef93-4328-abeb-85323db7aef7
> 2019-03-14 11:39:09,010 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
> ResourceManager akka.tcp://flink@cluster:31794/user/resourcemanager was
> granted leadership with fencing token ae4c0d30d0d65a0c41565360667e48fb
> 2019-03-14 11:39:09,011 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  -
> Starting the SlotManager.
> 2019-03-14 11:39:09,012 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher
> akka.tcp://flink@cluster:31794/user/dispatcher was granted leadership
> with fencing token c852ada2-5fd4-4ff8-80ab-c2cdd85a75d9
> 2019-03-14 11:39:09,017 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering
> all persisted jobs.
>
> *Task Manager Logs:*
>
> 2019-03-14 11:42:35,790 INFO
> org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager
> uses directory /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f for spill
> files.
> 2019-03-14 11:42:35,820 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration  - Messages
> have a max timeout of 10000 ms
> 2019-03-14 11:42:35,839 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
> RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at
> akka://flink/user/taskmanager_0 .
> 2019-03-14 11:42:35,853 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2019-03-14 11:42:35,854 INFO
> org.apache.flink.runtime.taskexecutor.JobLeaderService        - Start job
> leader service.
> 2019-03-14 11:42:35,855 INFO
> org.apache.flink.runtime.filecache.FileCache                  - User file
> cache uses directory
> /tmp/flink-dist-cache-a7f67948-ab57-4cd9-b2a6-0361b53ecd26
> 2019-03-14 11:42:35,871 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Connecting
> to ResourceManager akka.tcp://flink@cluster
> :31794/user/resourcemanager(ae4c0d30d0d65a0c41565360667e48fb).
> 2019-03-14 11:42:35,963 WARN
> akka.remote.ReliableDeliverySupervisor                        - Association
> with remote system [akka.tcp://flink@cluster:31794] has failed, address
> is now gated for [50] ms. Reason: [Association failed with
> [akka.tcp://flink@cluster:31794]] Caused by: [cluster: Name or service
> not known]
> 2019-03-14 11:42:35,964 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not
> resolve ResourceManager address akka.tcp://flink@cluster:31794/user/resourcemanager,
> retrying in 10000 ms: Could not connect to rpc endpoint under address
> akka.tcp://flink@cluster:31794/user/resourcemanager..
> 2019-03-14 11:47:35,895 ERROR
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Fatal error
> occurred in TaskExecutor akka.tcp://
> flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0.
> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
> Could not register at the ResourceManager within the specified maximum
> registration duration 300000 ms. This indicates a problem with this
> instance. Terminating now.
>    at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1037)
>    at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1023)
>    at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
>    at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
>    at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
>    at
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>    at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>    at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>    at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>    at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>    at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>    at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2019-03-14 11:47:35,897 ERROR
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Fatal error
> occurred while executing the TaskManager. Shutting it down...
> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
> Could not register at the ResourceManager within the specified maximum
> registration duration 300000 ms. This indicates a problem with this
> instance. Terminating now.
>    at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1037)
>    at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1023)
>    at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
>    at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
>    at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
>    at
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>    at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>    at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>    at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>    at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>    at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>    at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2019-03-14 11:47:35,904 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Stopping
> TaskExecutor akka.tcp://
> flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0.
> 2019-03-14 11:47:35,904 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2019-03-14 11:47:35,904 INFO
> org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager  -
> Shutting down TaskExecutorLocalStateStoresManager.
> 2019-03-14 11:47:35,908 INFO
> org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager
> removed spill file directory
> /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f
> 2019-03-14 11:47:35,908 INFO
> org.apache.flink.runtime.io.network.NetworkEnvironment        - Shutting
> down the network environment and its components.
> 2019-03-14 11:47:35,914 INFO
> org.apache.flink.runtime.io.network.netty.NettyClient         - Successful
> shutdown (took 5 ms).
> 2019-03-14 11:47:35,917 INFO
> org.apache.flink.runtime.io.network.netty.NettyServer         - Successful
> shutdown (took 2 ms).
> 2019-03-14 11:47:35,925 INFO
> org.apache.flink.runtime.taskexecutor.JobLeaderService        - Stop job
> leader service.
> 2019-03-14 11:47:35,931 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Stopped
> TaskExecutor akka.tcp://
> flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0.
> 2019-03-14 11:47:35,931 INFO
> org.apache.flink.runtime.blob.PermanentBlobCache              - Shutting
> down BLOB cache
> 2019-03-14 11:47:35,933 INFO
> org.apache.flink.runtime.blob.TransientBlobCache              - Shutting
> down BLOB cache
> 2019-03-14 11:47:35,943 INFO
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl
> - backgroundOperationsLoop exiting
> 2019-03-14 11:47:35,950 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  -
> Session: 0x26977a24c4e0018 closed
> 2019-03-14 11:47:35,950 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
> EventThread shut down for session: 0x26977a24c4e0018
> 2019-03-14 11:47:35,950 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Stopping
> Akka RPC service.
> 2019-03-14 11:47:35,952 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Shutting
> down remote daemon.
> 2019-03-14 11:47:35,952 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remote
> daemon shut down; proceeding with flushing remote transports.
> 2019-03-14 11:47:35,959 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Shutting
> down remote daemon.
> 2019-03-14 11:47:35,966 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remote
> daemon shut down; proceeding with flushing remote transports.
> 2019-03-14 11:47:35,983 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remoting
> shut down.
> 2019-03-14 11:47:35,984 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remoting
> shut down.
> 2019-03-14 11:47:35,992 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Stopped
> Akka RPC service.
>
>
>
>
>
> *From: *Gary Yao <ga...@ververica.com>
> *Date: *Thursday, 14 March 2019 at 9:06 PM
> *To: *Harshith Kumar Bolar <hk...@arity.com>
> *Cc: *user <us...@flink.apache.org>
> *Subject: *[External] Re: Flink 1.7.2: Task Manager not able to connect
> to Job Manager
>
>
>
> Hi Harshith,
>
>
>
> Can you share JM and TM logs?
>
>
>
> Best,
>
> Gary
>
>
>
> On Thu, Mar 14, 2019 at 3:42 PM Kumar Bolar, Harshith <hk...@arity.com>
> wrote:
>
> Hi all,
>
>
>
> I'm trying to upgrade our Flink cluster from 1.4.2 to 1.7.2
>
>
>
> When I bring up the cluster, the task managers refuse to connect to the
> job managers with the following error.
>
>
>
>         2019-03-14 10:34:41,551 WARN
> akka.remote.ReliableDeliverySupervisor
>
>         - Association with remote system [akka.tcp://flink@cluster:22671]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@cluster:22671]] Caused by: [cluster: Name or
> service not known]
>
>
>
> Now, this works correctly if I add the following line into
> the /etc/hosts file.
>
>
>
>         x.x.x.x job-manager-address.com
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__job-2Dmanager-2Daddress.com&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=04EWFpDL8G7AOCUH79K-QVwPa3NSJj7u4Qanpbrx0tg&s=KDu-Fxq2rWtLq1EmNp0DOuK0yWC6GyHwvhpbyQ8hRQg&e=>
> cluster
>
>
>
> Why is Flink 1.7.2 connecting to JM using cluster in the address? Flink
> 1.4.2 used to have the job manager's address instead of the word cluster.
>
>
>
> Thanks,
>
> Harshith
>
>
>
>

Re: Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

Posted by "Kumar Bolar, Harshith" <hk...@arity.com>.
Hi Gary,

I’ve attached the relevant portions of the JM and TM logs.

Job Manager Logs:

2019-03-14 11:38:28,257 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: CONNECTED
2019-03-14 11:38:28,309 INFO  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined location of main cluster component log file: /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.log
2019-03-14 11:38:28,309 INFO  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined location of main cluster component stdout file: /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.out
2019-03-14 11:38:28,527 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest endpoint listening at cluster:8080
2019-03-14 11:38:28,527 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
2019-03-14 11:38:28,574 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web frontend listening at http://cluster:8080.
2019-03-14 11:38:28,613 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2019-03-14 11:38:28,674 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2019-03-14 11:38:28,691 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2019-03-14 11:38:28,694 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-03-14 11:38:28,698 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
2019-03-14 11:38:28,700 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-03-14 11:38:28,818 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@cluster:22671] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@cluster:22671]] Caused by: [cluster]
2019-03-14 11:39:09,010 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://cluster:8080 was granted leadership with leaderSessionID=bbe408fc-ef93-4328-abeb-85323db7aef7
2019-03-14 11:39:09,010 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - ResourceManager akka.tcp://flink@cluster:31794/user/resourcemanager was granted leadership with fencing token ae4c0d30d0d65a0c41565360667e48fb
2019-03-14 11:39:09,011 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Starting the SlotManager.
2019-03-14 11:39:09,012 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://flink@cluster:31794/user/dispatcher was granted leadership with fencing token c852ada2-5fd4-4ff8-80ab-c2cdd85a75d9
2019-03-14 11:39:09,017 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.

Task Manager Logs:

2019-03-14 11:42:35,790 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager uses directory /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f for spill files.
2019-03-14 11:42:35,820 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration  - Messages have a max timeout of 10000 ms
2019-03-14 11:42:35,839 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at akka://flink/user/taskmanager_0 .
2019-03-14 11:42:35,853 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-03-14 11:42:35,854 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService        - Start job leader service.
2019-03-14 11:42:35,855 INFO  org.apache.flink.runtime.filecache.FileCache                  - User file cache uses directory /tmp/flink-dist-cache-a7f67948-ab57-4cd9-b2a6-0361b53ecd26
2019-03-14 11:42:35,871 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Connecting to ResourceManager akka.tcp://flink@cluster:31794/user/resourcemanager(ae4c0d30d0d65a0c41565360667e48fb).
2019-03-14 11:42:35,963 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@cluster:31794] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@cluster:31794]] Caused by: [cluster: Name or service not known]
2019-03-14 11:42:35,964 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not resolve ResourceManager address akka.tcp://flink@cluster:31794/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@cluster:31794/user/resourcemanager..
2019-03-14 11:47:35,895 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor            - Fatal error occurred in TaskExecutor akka.tcp://flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0.
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
   at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1037)
   at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1023)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
   at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
   at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
   at akka.actor.ActorCell.invoke(ActorCell.scala:495)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
   at akka.dispatch.Mailbox.run(Mailbox.scala:224)
   at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-03-14 11:47:35,897 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Fatal error occurred while executing the TaskManager. Shutting it down...
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
   at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1037)
   at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1023)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
   at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
   at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
   at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
   at akka.actor.ActorCell.invoke(ActorCell.scala:495)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
   at akka.dispatch.Mailbox.run(Mailbox.scala:224)
   at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-03-14 11:47:35,904 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Stopping TaskExecutor akka.tcp://flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0.
2019-03-14 11:47:35,904 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-03-14 11:47:35,904 INFO  org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager  - Shutting down TaskExecutorLocalStateStoresManager.
2019-03-14 11:47:35,908 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager removed spill file directory /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f
2019-03-14 11:47:35,908 INFO  org.apache.flink.runtime.io.network.NetworkEnvironment        - Shutting down the network environment and its components.
2019-03-14 11:47:35,914 INFO  org.apache.flink.runtime.io.network.netty.NettyClient         - Successful shutdown (took 5 ms).
2019-03-14 11:47:35,917 INFO  org.apache.flink.runtime.io.network.netty.NettyServer         - Successful shutdown (took 2 ms).
2019-03-14 11:47:35,925 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService        - Stop job leader service.
2019-03-14 11:47:35,931 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Stopped TaskExecutor akka.tcp://flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0.
2019-03-14 11:47:35,931 INFO  org.apache.flink.runtime.blob.PermanentBlobCache              - Shutting down BLOB cache
2019-03-14 11:47:35,933 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Shutting down BLOB cache
2019-03-14 11:47:35,943 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-03-14 11:47:35,950 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x26977a24c4e0018 closed
2019-03-14 11:47:35,950 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x26977a24c4e0018
2019-03-14 11:47:35,950 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Stopping Akka RPC service.
2019-03-14 11:47:35,952 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Shutting down remote daemon.
2019-03-14 11:47:35,952 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remote daemon shut down; proceeding with flushing remote transports.
2019-03-14 11:47:35,959 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Shutting down remote daemon.
2019-03-14 11:47:35,966 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remote daemon shut down; proceeding with flushing remote transports.
2019-03-14 11:47:35,983 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remoting shut down.
2019-03-14 11:47:35,984 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remoting shut down.
2019-03-14 11:47:35,992 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Stopped Akka RPC service.


From: Gary Yao <ga...@ververica.com>
Date: Thursday, 14 March 2019 at 9:06 PM
To: Harshith Kumar Bolar <hk...@arity.com>
Cc: user <us...@flink.apache.org>
Subject: [External] Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

Hi Harshith,

Can you share JM and TM logs?

Best,
Gary

On Thu, Mar 14, 2019 at 3:42 PM Kumar Bolar, Harshith <hk...@arity.com>> wrote:
Hi all,

I'm trying to upgrade our Flink cluster from 1.4.2 to 1.7.2

When I bring up the cluster, the task managers refuse to connect to the job managers with the following error.

        2019-03-14 10:34:41,551 WARN  akka.remote.ReliableDeliverySupervisor
        - Association with remote system [akka.tcp://flink@cluster:22671] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@cluster:22671]] Caused by: [cluster: Name or service not known]

Now, this works correctly if I add the following line into the /etc/hosts file.

        x.x.x.x job-manager-address.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__job-2Dmanager-2Daddress.com&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=04EWFpDL8G7AOCUH79K-QVwPa3NSJj7u4Qanpbrx0tg&s=KDu-Fxq2rWtLq1EmNp0DOuK0yWC6GyHwvhpbyQ8hRQg&e=> cluster

Why is Flink 1.7.2 connecting to JM using cluster in the address? Flink 1.4.2 used to have the job manager's address instead of the word cluster.

Thanks,
Harshith


Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

Posted by Gary Yao <ga...@ververica.com>.
Hi Harshith,

Can you share JM and TM logs?

Best,
Gary

On Thu, Mar 14, 2019 at 3:42 PM Kumar Bolar, Harshith <hk...@arity.com>
wrote:

> Hi all,
>
>
>
> I'm trying to upgrade our Flink cluster from 1.4.2 to 1.7.2
>
>
>
> When I bring up the cluster, the task managers refuse to connect to the
> job managers with the following error.
>
>
>
>         2019-03-14 10:34:41,551 WARN
> akka.remote.ReliableDeliverySupervisor
>
>         - Association with remote system [akka.tcp://flink@cluster:22671]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@cluster:22671]] Caused by: [cluster: Name or
> service not known]
>
>
>
> Now, this works correctly if I add the following line into
> the /etc/hosts file.
>
>
>
>         x.x.x.x job-manager-address.com cluster
>
>
>
> Why is Flink 1.7.2 connecting to JM using cluster in the address? Flink
> 1.4.2 used to have the job manager's address instead of the word cluster.
>
>
>
> Thanks,
>
> Harshith
>
>
>