You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Jason Kania <ja...@ymail.com> on 2018/05/15 03:15:52 UTC

Leader Retrieval Timeout with HA Job Manager

Hi,

I am using the 1.4.2 release on ubuntu and attempting to make use of an HA Job Manager, but unfortunately using HA functionality prevents job submission with the following error:

java.lang.RuntimeException: Failed to retrieve JobManager address
        at org.apache.flink.client.program.ClusterClient.getJobManagerAddress(ClusterClient.java:308)
        at org.apache.flink.client.program.StandaloneClusterClient.getClusterIdentifier(StandaloneClusterClient.java:86)
        at org.apache.flink.client.CliFrontend.createClient(CliFrontend.java:921)
        at org.apache.flink.client.CliFrontend.run(CliFrontend.java:264)
        at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1054)
        at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)
        at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)
        at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
        at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader address and leader session ID.
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderConnectionInfo(LeaderRetrievalUtils.java:113)
        at org.apache.flink.client.program.ClusterClient.getJobManagerAddress(ClusterClient.java:302)
        ... 8 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [60000 milliseconds]
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
        at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
        at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:190)
        at scala.concurrent.Await.result(package.scala)
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderConnectionInfo(LeaderRetrievalUtils.java:111)
        ... 9 more

This seems to also be tied to problems in having the TaskManager register. I have to repeatedly restart the TaskManager until it finally connects to the Job Manager. Most times it doesn't connect and doesn't complain making the determination of the root cause more difficult. The cluster is not busy and I have tried both with IP addresses and host names to determine if name resolution issues were the cause, but both situations are the same.

I have also noticed that if 2 job managers are launched on different nodes in the same cluster, they both come back with logging indicating that they are the leader so they are not talking to each other effectively and the logging is not even indicating that they are even attempting to talk with one another.

Lastly, the error "Could not retrieve the leader address and leader session ID." is a very poor error because it does not tell where it is attempting to get the information from.

Any suggestions would be appreciated.

Thanks,

Jason

Re: Leader Retrieval Timeout with HA Job Manager

Posted by Till Rohrmann <tr...@apache.org>.

Hi Jason,

sorry for the late response. The logs look indeed strange because both JMs
are granted leadership without the other getting its leadership revoked.
What would be interesting is to take a look at the Znode under
`/flink/flink-ha/leader/00000000000000000000000000000000/job_manager_lock`
in your ZooKeeper cluster. You should be able to take a look at the Znodes
with `bin/zkCli.sh -server <SERVER_ADDRESS>`. One of the TMs retrieves the
leader address and the other is not notified about the leader change. It is
as if the ZooKeeper cluster does not notify Flink about the changes.

Cheers,
Till

On Tue, May 15, 2018 at 5:33 PM, Jason Kania <ja...@ymail.com> wrote:

> Thanks for your help. The job manager launch on two nodes of the cluster
> is provided as well as the logs for the task managers, one that worked and
> one that could not seem to find the find an address which I am assuming is
> for the job manager. The logs are from nodes aaa-1 and aaa-2.
>
> Thanks,
>
> Jason
>
> On Tuesday, May 15, 2018, 9:59:58 a.m. EDT, Timo Walther <
> twalthr@apache.org> wrote:
>
>
> Can you change the log level to DEBUG and share the logs with us? Maybe
> Till (in CC) has some idea?
>
> Regards,
> Timo
>
>
> Am 15.05.18 um 15:18 schrieb Jason Kania:
>
> Hi Timo,
>
> Thanks for the response.
>
> Yes, we are running with a cloud provider, a cloud system provided by our
> national government for R&D purposes. The thing is that we also have Kafka
> and Cassandra on the same nodes and they have no issues in this
> environment, it is just Flink in an HA configuration that has problems so
> it is strange.
>
> Is there any additional logging available for analysis of these sorts of
> scenarios? The details in the current logs are insufficient to know what is
> happening.
>
> Thanks,
>
> Jason
>
> On Tuesday, May 15, 2018, 7:51:40 a.m. EDT, Timo Walther
> <tw...@apache.org> <tw...@apache.org> wrote:
>
>
> Hi Jason,
>
> this sounds more like a network connection/firewall issue to me. Can you
> tell us a bit more about your environment? Are you running your Flink
> cluster on a cloud provider?
>
> Regards,
> Timo
>
>
> Am 15.05.18 um 05:15 schrieb Jason Kania:
>
> Hi,
>
> I am using the 1.4.2 release on ubuntu and attempting to make use of an HA
> Job Manager, but unfortunately using HA functionality prevents job
> submission with the following error:
>
> java.lang.RuntimeException: Failed to retrieve JobManager address
>         at org.apache.flink.client.program.ClusterClient.
> getJobManagerAddress(ClusterClient.java:308)
>         at org.apache.flink.client.program.StandaloneClusterClient.
> getClusterIdentifier(StandaloneClusterClient.java:86)
>         at org.apache.flink.client.CliFrontend.createClient(
> CliFrontend.java:921)
>         at org.apache.flink.client.CliFrontend.run(CliFrontend.java:264)
>         at org.apache.flink.client.CliFrontend.parseParameters(
> CliFrontend.java:1054)
>         at org.apache.flink.client.CliFrontend$1.call(
> CliFrontend.java:1101)
>         at org.apache.flink.client.CliFrontend$1.call(
> CliFrontend.java:1098)
>         at org.apache.flink.runtime.security.NoOpSecurityContext.
> runSecured(NoOpSecurityContext.java:30)
>         at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
> Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException:
> Could not retrieve the leader address and leader session ID.
>         at org.apache.flink.runtime.util.LeaderRetrievalUtils.
> retrieveLeaderConnectionInfo(LeaderRetrievalUtils.java:113)
>         at org.apache.flink.client.program.ClusterClient.
> getJobManagerAddress(ClusterClient.java:302)
>         ... 8 more
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [60000 milliseconds]
>         at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.
> scala:223)
>         at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.
> scala:227)
>         at scala.concurrent.Await$$anonfun$result$1.apply(
> package.scala:190)
>         at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(
> BlockContext.scala:53)
>         at scala.concurrent.Await$.result(package.scala:190)
>         at scala.concurrent.Await.result(package.scala)
>         at org.apache.flink.runtime.util.LeaderRetrievalUtils.
> retrieveLeaderConnectionInfo(LeaderRetrievalUtils.java:111)
>         ... 9 more
>
> This seems to also be tied to problems in having the TaskManager register.
> I have to repeatedly restart the TaskManager until it finally connects to
> the Job Manager. Most times it doesn't connect and doesn't complain making
> the determination of the root cause more difficult. The cluster is not busy
> and I have tried both with IP addresses and host names to determine if name
> resolution issues were the cause, but both situations are the same.
>
> I have also noticed that if 2 job managers are launched on different nodes
> in the same cluster, they both come back with logging indicating that they
> are the leader so they are not talking to each other effectively and the
> logging is not even indicating that they are even attempting to talk with
> one another.
>
> Lastly, the error "Could not retrieve the leader address and leader
> session ID." is a very poor error because it does not tell where it is
> attempting to get the information from.
>
> Any suggestions would be appreciated.
>
> Thanks,
>
> Jason
>
>
>
>

Re: Leader Retrieval Timeout with HA Job Manager

Posted by Jason Kania <ja...@ymail.com>.

 Thanks for your help. The job manager launch on two nodes of the cluster is provided as well as the logs for the task managers, one that worked and one that could not seem to find the find an address which I am assuming is for the job manager. The logs are from nodes aaa-1 and aaa-2.

Thanks,

Jason

    On Tuesday, May 15, 2018, 9:59:58 a.m. EDT, Timo Walther <tw...@apache.org> wrote:  
 
  Can you change the log level to DEBUG and share the logs with us? Maybe Till (in CC) has some idea?
 
 Regards,
 Timo
 
 
 Am 15.05.18 um 15:18 schrieb Jason Kania:
  
  Hi Timo,
  
 Thanks for the response.
 
   Yes, we are running with a cloud provider, a cloud system provided by our national government for R&D purposes. The thing is that we also have Kafka and Cassandra on the same nodes and they have no issues in this environment, it is just Flink in an HA configuration that has problems so it is strange.
 
 Is there any additional logging available for analysis of these sorts of scenarios? The details in the current logs are insufficient to know what is happening.
 
 Thanks,
 
 Jason
         
     On Tuesday, May 15, 2018, 7:51:40 a.m. EDT, Timo Walther <tw...@apache.org> wrote:  
  
     Hi Jason,
 
 this sounds more like a network connection/firewall issue to me. Can you tell us a bit more about your environment? Are you running your Flink cluster on a cloud provider?
 
 Regards,
 Timo
 
 
 Am 15.05.18 um 05:15 schrieb Jason Kania:
   
  Hi,
 
 I am using the 1.4.2 release on ubuntu and attempting to make use of an HA Job Manager, but unfortunately using HA  functionality prevents job submission with the following error:
 
 java.lang.RuntimeException: Failed to retrieve JobManager address
         atorg.apache.flink.client.program.ClusterClient.getJobManagerAddress(ClusterClient.java:308)
         atorg.apache.flink.client.program.StandaloneClusterClient.getClusterIdentifier(StandaloneClusterClient.java:86)
         at org.apache.flink.client.CliFrontend.createClient(CliFrontend.java:921)
         at org.apache.flink.client.CliFrontend.run(CliFrontend.java:264)
         atorg.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1054)
         at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)
         at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)
         atorg.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
         at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
 Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException:  Could not retrieve the leader address and leader session ID.
         atorg.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderConnectionInfo(LeaderRetrievalUtils.java:113)
         atorg.apache.flink.client.program.ClusterClient.getJobManagerAddress(ClusterClient.java:302)
         ... 8 more
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after [60000 milliseconds]
         at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
         at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
         at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
         atscala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
         at scala.concurrent.Await$.result(package.scala:190)
         at scala.concurrent.Await.result(package.scala)
         atorg.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderConnectionInfo(LeaderRetrievalUtils.java:111)
         ... 9 more
 
   This seems to also be tied to problems in having the TaskManager register. I have to repeatedly  restart the TaskManager until it finally connects to the Job Manager. Most times it doesn't connect and doesn't complain making the determination of the root cause more  difficult. The cluster is not busy and I have tried both with IP addresses and host names to  determine if name resolution issues were the cause, but both situations are the same.
 
 I have also noticed that if 2 job managers are launched on different  nodes in the same cluster, they both come back with logging indicating that they are the  leader so they are not talking to each other effectively and the logging is not even indicating that they are even attempting to talk with one another.
   
 Lastly, the error "Could not retrieve the leader address and leader session ID." is a very poor error because it does not tell where it is attempting to get the information from.
  
 Any suggestions would be appreciated.
 
   Thanks,
 
 Jason

Re: Leader Retrieval Timeout with HA Job Manager

Posted by Till Rohrmann <tr...@apache.org>.

Hi Jason,

the client logs would indeed be very interesting to further debug this
problem. What you have to make sure is that the client has the same HA
configuration settings as the cluster because the client needs to talk to
your ZooKeeper quorum in order to retrieve the leader address.

When executing 2 JobManagers, then only one should be the leader.
Everything else sounds either like a serious bug or a setup problem. Could
you maybe share the logs for these processes as well?

Cheers,
Till

On Tue, May 15, 2018 at 3:59 PM, Timo Walther <tw...@apache.org> wrote:

> Can you change the log level to DEBUG and share the logs with us? Maybe
> Till (in CC) has some idea?
>
> Regards,
> Timo
>
>
> Am 15.05.18 um 15:18 schrieb Jason Kania:
>
> Hi Timo,
>
> Thanks for the response.
>
> Yes, we are running with a cloud provider, a cloud system provided by our
> national government for R&D purposes. The thing is that we also have Kafka
> and Cassandra on the same nodes and they have no issues in this
> environment, it is just Flink in an HA configuration that has problems so
> it is strange.
>
> Is there any additional logging available for analysis of these sorts of
> scenarios? The details in the current logs are insufficient to know what is
> happening.
>
> Thanks,
>
> Jason
>
> On Tuesday, May 15, 2018, 7:51:40 a.m. EDT, Timo Walther
> <tw...@apache.org> <tw...@apache.org> wrote:
>
>
> Hi Jason,
>
> this sounds more like a network connection/firewall issue to me. Can you
> tell us a bit more about your environment? Are you running your Flink
> cluster on a cloud provider?
>
> Regards,
> Timo
>
>
> Am 15.05.18 um 05:15 schrieb Jason Kania:
>
> Hi,
>
> I am using the 1.4.2 release on ubuntu and attempting to make use of an HA
> Job Manager, but unfortunately using HA functionality prevents job
> submission with the following error:
>
> java.lang.RuntimeException: Failed to retrieve JobManager address
>         at org.apache.flink.client.program.ClusterClient.
> getJobManagerAddress(ClusterClient.java:308)
>         at org.apache.flink.client.program.StandaloneClusterClient.
> getClusterIdentifier(StandaloneClusterClient.java:86)
>         at org.apache.flink.client.CliFrontend.createClient(
> CliFrontend.java:921)
>         at org.apache.flink.client.CliFrontend.run(CliFrontend.java:264)
>         at org.apache.flink.client.CliFrontend.parseParameters(
> CliFrontend.java:1054)
>         at org.apache.flink.client.CliFrontend$1.call(
> CliFrontend.java:1101)
>         at org.apache.flink.client.CliFrontend$1.call(
> CliFrontend.java:1098)
>         at org.apache.flink.runtime.security.NoOpSecurityContext.
> runSecured(NoOpSecurityContext.java:30)
>         at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
> Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException:
> Could not retrieve the leader address and leader session ID.
>         at org.apache.flink.runtime.util.LeaderRetrievalUtils.
> retrieveLeaderConnectionInfo(LeaderRetrievalUtils.java:113)
>         at org.apache.flink.client.program.ClusterClient.
> getJobManagerAddress(ClusterClient.java:302)
>         ... 8 more
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [60000 milliseconds]
>         at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.
> scala:223)
>         at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.
> scala:227)
>         at scala.concurrent.Await$$anonfun$result$1.apply(
> package.scala:190)
>         at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(
> BlockContext.scala:53)
>         at scala.concurrent.Await$.result(package.scala:190)
>         at scala.concurrent.Await.result(package.scala)
>         at org.apache.flink.runtime.util.LeaderRetrievalUtils.
> retrieveLeaderConnectionInfo(LeaderRetrievalUtils.java:111)
>         ... 9 more
>
> This seems to also be tied to problems in having the TaskManager register.
> I have to repeatedly restart the TaskManager until it finally connects to
> the Job Manager. Most times it doesn't connect and doesn't complain making
> the determination of the root cause more difficult. The cluster is not busy
> and I have tried both with IP addresses and host names to determine if name
> resolution issues were the cause, but both situations are the same.
>
> I have also noticed that if 2 job managers are launched on different nodes
> in the same cluster, they both come back with logging indicating that they
> are the leader so they are not talking to each other effectively and the
> logging is not even indicating that they are even attempting to talk with
> one another.
>
> Lastly, the error "Could not retrieve the leader address and leader
> session ID." is a very poor error because it does not tell where it is
> attempting to get the information from.
>
> Any suggestions would be appreciated.
>
> Thanks,
>
> Jason
>
>
>
>

Re: Leader Retrieval Timeout with HA Job Manager

Posted by Timo Walther <tw...@apache.org>.

Can you change the log level to DEBUG and share the logs with us? Maybe 
Till (in CC) has some idea?

Regards,
Timo


Am 15.05.18 um 15:18 schrieb Jason Kania:
> Hi Timo,
>
> Thanks for the response.
>
> Yes, we are running with a cloud provider, a cloud system provided by 
> our national government for R&D purposes. The thing is that we also 
> have Kafka and Cassandra on the same nodes and they have no issues in 
> this environment, it is just Flink in an HA configuration that has 
> problems so it is strange.
>
> Is there any additional logging available for analysis of these sorts 
> of scenarios? The details in the current logs are insufficient to know 
> what is happening.
>
> Thanks,
>
> Jason
>
> On Tuesday, May 15, 2018, 7:51:40 a.m. EDT, Timo Walther 
> <tw...@apache.org> wrote:
>
>
> Hi Jason,
>
> this sounds more like a network connection/firewall issue to me. Can 
> you tell us a bit more about your environment? Are you running your 
> Flink cluster on a cloud provider?
>
> Regards,
> Timo
>
>
> Am 15.05.18 um 05:15 schrieb Jason Kania:
>> Hi,
>>
>> I am using the 1.4.2 release on ubuntu and attempting to make use of 
>> an HA Job Manager, but unfortunately using HA functionality prevents 
>> job submission with the following error:
>>
>> java.lang.RuntimeException: Failed to retrieve JobManager address
>>         at 
>> org.apache.flink.client.program.ClusterClient.getJobManagerAddress(ClusterClient.java:308)
>>         at 
>> org.apache.flink.client.program.StandaloneClusterClient.getClusterIdentifier(StandaloneClusterClient.java:86)
>>         at 
>> org.apache.flink.client.CliFrontend.createClient(CliFrontend.java:921)
>>         at org.apache.flink.client.CliFrontend.run(CliFrontend.java:264)
>>         at 
>> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1054)
>>         at 
>> org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)
>>         at 
>> org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)
>>         at 
>> org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
>>         at 
>> org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
>> Caused by: 
>> org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: 
>> Could not retrieve the leader address and leader session ID.
>>         at 
>> org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderConnectionInfo(LeaderRetrievalUtils.java:113)
>>         at 
>> org.apache.flink.client.program.ClusterClient.getJobManagerAddress(ClusterClient.java:302)
>>         ... 8 more
>> Caused by: java.util.concurrent.TimeoutException: Futures timed out 
>> after [60000 milliseconds]
>>         at 
>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>>         at 
>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>>         at 
>> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
>>         at 
>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>>         at scala.concurrent.Await$.result(package.scala:190)
>>         at scala.concurrent.Await.result(package.scala)
>>         at 
>> org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderConnectionInfo(LeaderRetrievalUtils.java:111)
>>         ... 9 more
>>
>> This seems to also be tied to problems in having the TaskManager 
>> register. I have to repeatedly restart the TaskManager until it 
>> finally connects to the Job Manager. Most times it doesn't connect 
>> and doesn't complain making the determination of the root cause more 
>> difficult. The cluster is not busy and I have tried both with IP 
>> addresses and host names to determine if name resolution issues were 
>> the cause, but both situations are the same.
>>
>> I have also noticed that if 2 job managers are launched on different 
>> nodes in the same cluster, they both come back with logging 
>> indicating that they are the leader so they are not talking to each 
>> other effectively and the logging is not even indicating that they 
>> are even attempting to talk with one another.
>>
>> Lastly, the error "Could not retrieve the leader address and leader 
>> session ID." is a very poor error because it does not tell where it 
>> is attempting to get the information from.
>>
>> Any suggestions would be appreciated.
>>
>> Thanks,
>>
>> Jason
>
>

Re: Leader Retrieval Timeout with HA Job Manager

Posted by Jason Kania <ja...@ymail.com>.

 Hi Timo,

Thanks for the response.

Yes, we are running with a cloud provider, a cloud system provided by our national government for R&D purposes. The thing is that we also have Kafka and Cassandra on the same nodes and they have no issues in this environment, it is just Flink in an HA configuration that has problems so it is strange.

Is there any additional logging available for analysis of these sorts of scenarios? The details in the current logs are insufficient to know what is happening.

Thanks,

Jason

    On Tuesday, May 15, 2018, 7:51:40 a.m. EDT, Timo Walther <tw...@apache.org> wrote:  
 
  Hi Jason,
 
 this sounds more like a network connection/firewall issue to me. Can you tell us a bit more about your environment? Are you running your Flink cluster on a cloud provider?
 
 Regards,
 Timo
 
 
 Am 15.05.18 um 05:15 schrieb Jason Kania:
  
  Hi,
 
 I am using the 1.4.2 release on ubuntu and attempting to make use of an HA Job Manager, but unfortunately using HA functionality prevents job submission with the following error:
 
 java.lang.RuntimeException: Failed to retrieve JobManager address
         atorg.apache.flink.client.program.ClusterClient.getJobManagerAddress(ClusterClient.java:308)
         atorg.apache.flink.client.program.StandaloneClusterClient.getClusterIdentifier(StandaloneClusterClient.java:86)
         at org.apache.flink.client.CliFrontend.createClient(CliFrontend.java:921)
         at org.apache.flink.client.CliFrontend.run(CliFrontend.java:264)
         atorg.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1054)
         at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)
         at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)
         atorg.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
         at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
 Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader address and leader session ID.
         atorg.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderConnectionInfo(LeaderRetrievalUtils.java:113)
         atorg.apache.flink.client.program.ClusterClient.getJobManagerAddress(ClusterClient.java:302)
         ... 8 more
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after [60000 milliseconds]
         at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
         at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
         at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
         atscala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
         at scala.concurrent.Await$.result(package.scala:190)
         at scala.concurrent.Await.result(package.scala)
         atorg.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderConnectionInfo(LeaderRetrievalUtils.java:111)
         ... 9 more
 
   This seems to also be tied to problems in having the TaskManager register. I have to repeatedly restart the TaskManager until it finally connects to the Job Manager. Most times it doesn't connect and doesn't complain making the determination of the root cause more difficult. The cluster is not busy and I have tried both with IP addresses and host names to determine if name resolution issues were  the cause, but both situations are the same.
 
 I have also noticed that if 2 job managers are launched on different nodes in the same cluster, they both come back with logging indicating that they are the leader so they are not talking to each other effectively and the logging is not even  indicating that they are even attempting to talk with one another.
   
 Lastly, the error "Could not retrieve the leader address and leader session ID." is a very poor error because it does not tell where it is attempting to get the information from.
  
 Any suggestions would be appreciated.
 
   Thanks,
 
 Jason

Re: Leader Retrieval Timeout with HA Job Manager

Posted by Timo Walther <tw...@apache.org>.

Hi Jason,

this sounds more like a network connection/firewall issue to me. Can you 
tell us a bit more about your environment? Are you running your Flink 
cluster on a cloud provider?

Regards,
Timo


Am 15.05.18 um 05:15 schrieb Jason Kania:
> Hi,
>
> I am using the 1.4.2 release on ubuntu and attempting to make use of 
> an HA Job Manager, but unfortunately using HA functionality prevents 
> job submission with the following error:
>
> java.lang.RuntimeException: Failed to retrieve JobManager address
>         at 
> org.apache.flink.client.program.ClusterClient.getJobManagerAddress(ClusterClient.java:308)
>         at 
> org.apache.flink.client.program.StandaloneClusterClient.getClusterIdentifier(StandaloneClusterClient.java:86)
>         at 
> org.apache.flink.client.CliFrontend.createClient(CliFrontend.java:921)
>         at org.apache.flink.client.CliFrontend.run(CliFrontend.java:264)
>         at 
> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1054)
>         at 
> org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)
>         at 
> org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)
>         at 
> org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
>         at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
> Caused by: 
> org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: 
> Could not retrieve the leader address and leader session ID.
>         at 
> org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderConnectionInfo(LeaderRetrievalUtils.java:113)
>         at 
> org.apache.flink.client.program.ClusterClient.getJobManagerAddress(ClusterClient.java:302)
>         ... 8 more
> Caused by: java.util.concurrent.TimeoutException: Futures timed out 
> after [60000 milliseconds]
>         at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>         at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>         at 
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
>         at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>         at scala.concurrent.Await$.result(package.scala:190)
>         at scala.concurrent.Await.result(package.scala)
>         at 
> org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderConnectionInfo(LeaderRetrievalUtils.java:111)
>         ... 9 more
>
> This seems to also be tied to problems in having the TaskManager 
> register. I have to repeatedly restart the TaskManager until it 
> finally connects to the Job Manager. Most times it doesn't connect and 
> doesn't complain making the determination of the root cause more 
> difficult. The cluster is not busy and I have tried both with IP 
> addresses and host names to determine if name resolution issues were 
> the cause, but both situations are the same.
>
> I have also noticed that if 2 job managers are launched on different 
> nodes in the same cluster, they both come back with logging indicating 
> that they are the leader so they are not talking to each other 
> effectively and the logging is not even indicating that they are even 
> attempting to talk with one another.
>
> Lastly, the error "Could not retrieve the leader address and leader 
> session ID." is a very poor error because it does not tell where it is 
> attempting to get the information from.
>
> Any suggestions would be appreciated.
>
> Thanks,
>
> Jason