You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Jeroen Steggink | knowsy <je...@knowsy.nl> on 2018/10/26 11:40:48 UTC

HA jobmanagers redirect to ip address of leader instead of hostname

Hi,

I'm having some troubles with Flink jobmanagers in a HA setup within 
OpenShift.

I have three jobmanagers, a Zookeeper cluster and a loadbalancer 
(Openshift/Kubernetes Route) for the web ui / rest server on the 
jobmanagers. Everything works fine, as long as the loadbalancer connects 
to the leader. However, when the leader changes and the loadbalancer 
connects to a non-leader, the jobmanager redirects to a leader using the 
ip address of the host. Since the routing in our network is done using 
hostnames, it doesn't know how to find the node using the ip address and 
results in a timeout.

So I have a few questions:
1. Why is Flink using the ip addresses instead of the hostname which are 
configured in the config? Other times it does use the hostname, like the 
info send to Zookeeper.
2. Is there another way of coping with connections to non-leaders 
instead of redirects? Maybe proxying through a non-leader to the leader?

Cheers,
Jeroen

Re: HA jobmanagers redirect to ip address of leader instead of hostname

Posted by Till Rohrmann <tr...@apache.org>.

Thanks for opening this issue Jeroen. I think we need to look into this.
Could you maybe share the Flink logs with the community? Ideally you attach
them to the JIRA issue.

Cheers,
Till

On Thu, Nov 8, 2018 at 10:51 PM Jeroen Steggink | knowsy <je...@knowsy.nl>
wrote:

> Hi Till,
>
> Thanks for your reply. We are running version 1.5.4. We can't upgrade to
> 1.6.x because we are using Apache Beam which doesn't support 1.6.x yet.
>
> I have also made a Jira issue about this:
> https://issues.apache.org/jira/projects/FLINK/issues/FLINK-10748
> Best regards,
> Jeroen Steggink
>
> On 08-Nov-18 11:40, Jeroen Steggink | knowsy wrote:
>
> Hi Till,
>
> Thanks for your reply. We are running version 1.5.4. We can't upgrade to
> 1.6.x because we are using Apache Beam which doesn't support 1.6.x yet.
>
> I have also made a Jira issue about this:
> https://issues.apache.org/jira/projects/FLINK/issues/FLINK-10748
> Best regards,
> Jeroen Steggink
>
> On 07-Nov-18 16:06, Till Rohrmann wrote:
>
> Hi Jeroen,
>
> this sounds like a bug in Flink that we return sometimes IP addresses
> instead of hostnames. Could you tell me which Flink version you are using?
> In the current version, the redirect address and the address retrieved from
> ZooKeeper should actually be the same.
>
> In the future, we plan to remove the redirect message and simply forward
> the request to the current leader. This should hopefully avoid these kind
> of problems.
>
> Cheers,
> Till
>
> On Fri, Oct 26, 2018 at 1:40 PM Jeroen Steggink | knowsy <je...@knowsy.nl>
> wrote:
>
>> Hi,
>>
>> I'm having some troubles with Flink jobmanagers in a HA setup within
>> OpenShift.
>>
>> I have three jobmanagers, a Zookeeper cluster and a loadbalancer
>> (Openshift/Kubernetes Route) for the web ui / rest server on the
>> jobmanagers. Everything works fine, as long as the loadbalancer connects
>> to the leader. However, when the leader changes and the loadbalancer
>> connects to a non-leader, the jobmanager redirects to a leader using the
>> ip address of the host. Since the routing in our network is done using
>> hostnames, it doesn't know how to find the node using the ip address and
>> results in a timeout.
>>
>> So I have a few questions:
>> 1. Why is Flink using the ip addresses instead of the hostname which are
>> configured in the config? Other times it does use the hostname, like the
>> info send to Zookeeper.
>> 2. Is there another way of coping with connections to non-leaders
>> instead of redirects? Maybe proxying through a non-leader to the leader?
>>
>> Cheers,
>> Jeroen
>>
>>
>
>

Re: HA jobmanagers redirect to ip address of leader instead of hostname

Posted by Jeroen Steggink | knowsy <je...@knowsy.nl>.

Hi Till,

Thanks for your reply. We are running version 1.5.4. We can't upgrade to 
1.6.x because we are using Apache Beam which doesn't support 1.6.x yet.

I have also made a Jira issue about this: 
https://issues.apache.org/jira/projects/FLINK/issues/FLINK-10748

Best regards,
Jeroen Steggink

On 08-Nov-18 11:40, Jeroen Steggink | knowsy wrote:
>
> Hi Till,
>
> Thanks for your reply. We are running version 1.5.4. We can't upgrade 
> to 1.6.x because we are using Apache Beam which doesn't support 1.6.x yet.
>
> I have also made a Jira issue about this: 
> https://issues.apache.org/jira/projects/FLINK/issues/FLINK-10748
>
> Best regards,
> Jeroen Steggink
>
> On 07-Nov-18 16:06, Till Rohrmann wrote:
>> Hi Jeroen,
>>
>> this sounds like a bug in Flink that we return sometimes IP addresses 
>> instead of hostnames. Could you tell me which Flink version you are 
>> using? In the current version, the redirect address and the address 
>> retrieved from ZooKeeper should actually be the same.
>>
>> In the future, we plan to remove the redirect message and simply 
>> forward the request to the current leader. This should hopefully 
>> avoid these kind of problems.
>>
>> Cheers,
>> Till
>>
>> On Fri, Oct 26, 2018 at 1:40 PM Jeroen Steggink | knowsy 
>> <jeroen@knowsy.nl <ma...@knowsy.nl>> wrote:
>>
>>     Hi,
>>
>>     I'm having some troubles with Flink jobmanagers in a HA setup within
>>     OpenShift.
>>
>>     I have three jobmanagers, a Zookeeper cluster and a loadbalancer
>>     (Openshift/Kubernetes Route) for the web ui / rest server on the
>>     jobmanagers. Everything works fine, as long as the loadbalancer
>>     connects
>>     to the leader. However, when the leader changes and the loadbalancer
>>     connects to a non-leader, the jobmanager redirects to a leader
>>     using the
>>     ip address of the host. Since the routing in our network is done
>>     using
>>     hostnames, it doesn't know how to find the node using the ip
>>     address and
>>     results in a timeout.
>>
>>     So I have a few questions:
>>     1. Why is Flink using the ip addresses instead of the hostname
>>     which are
>>     configured in the config? Other times it does use the hostname,
>>     like the
>>     info send to Zookeeper.
>>     2. Is there another way of coping with connections to non-leaders
>>     instead of redirects? Maybe proxying through a non-leader to the
>>     leader?
>>
>>     Cheers,
>>     Jeroen
>>
>

Re: HA jobmanagers redirect to ip address of leader instead of hostname

Posted by Till Rohrmann <tr...@apache.org>.

Hi Jeroen,

this sounds like a bug in Flink that we return sometimes IP addresses
instead of hostnames. Could you tell me which Flink version you are using?
In the current version, the redirect address and the address retrieved from
ZooKeeper should actually be the same.

In the future, we plan to remove the redirect message and simply forward
the request to the current leader. This should hopefully avoid these kind
of problems.

Cheers,
Till

On Fri, Oct 26, 2018 at 1:40 PM Jeroen Steggink | knowsy <je...@knowsy.nl>
wrote:

> Hi,
>
> I'm having some troubles with Flink jobmanagers in a HA setup within
> OpenShift.
>
> I have three jobmanagers, a Zookeeper cluster and a loadbalancer
> (Openshift/Kubernetes Route) for the web ui / rest server on the
> jobmanagers. Everything works fine, as long as the loadbalancer connects
> to the leader. However, when the leader changes and the loadbalancer
> connects to a non-leader, the jobmanager redirects to a leader using the
> ip address of the host. Since the routing in our network is done using
> hostnames, it doesn't know how to find the node using the ip address and
> results in a timeout.
>
> So I have a few questions:
> 1. Why is Flink using the ip addresses instead of the hostname which are
> configured in the config? Other times it does use the hostname, like the
> info send to Zookeeper.
> 2. Is there another way of coping with connections to non-leaders
> instead of redirects? Maybe proxying through a non-leader to the leader?
>
> Cheers,
> Jeroen
>
>