You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Fabian Hueske <fh...@gmail.com> on 2018/05/07 12:38:25 UTC

Re: strange behavior with jobmanager.rpc.address on standalone HA cluster

Hi Derek,

1. I've created a JIRA issue to improve the docs as you recommended [1].

2. This discussion goes quite a bit into the internals of the HA setup. Let
me pull in Till (in CC) who knows the details of HA.

Best, Fabian

[1] https://issues.apache.org/jira/browse/FLINK-9309

2018-05-05 15:34 GMT+02:00 Derek VerLee <de...@gmail.com>:

> Two things:
>
> 1. It would be beneficial I think to drop a line somewhere in the docs
> (probably on the production ready checklist as well as the HA page)
> explaining that enabling zookeeper "highavailability" allows for your jobs
> to restart automatically after a jobmanager crash or restart.  We had spent
> some cycles trying to implement job restarting and watchdogs (poorly) when
> I discoverd this from a flink forward presentation on youtube.
>
> 2. I seem to have found some odd behavior with HA and then found something
> that works, but I can't explain why.  The clifnotes version is that I took
> an existing standalone cluster with a single JM and modified with high
> availability zookeeper mode.  The same flink-conf.yaml file is used on all
> nodes (including JM). This seemed to work fine, I restarted the JM (jm0)
> and the jobs relaunched when it came back.  Easy!  Then I deployed a second
> JM (jm1).  Once I modified `masters`, set the HA rpc port range and opened
> those ports on the firewall for both jobmanagers, but left
> `jobmanager.rpc.address` the original value, `jm0` on all nodes.  I then
> observed that jm0 worked fine, taskmanagers connected to it and jobs ran.
> jm1 did not 301 me to jm0 however, it displayed a dashboard (no jobs, no
> tm).  When I stopped jm0, the jobs show up on jm1 as RESTARTING, but the
> taskmanagers never attach to jm1.   In the logs, all nodes, including jm1,
> had messages about trying to reach jm0.  From the documentation and various
> comments I've seen, `jobmanager.rpc.address` should be ignored.  However,
> commenting it out entirely lead to jobmanagers crashing at boot, setting to
> `localhost` caused all the taskmanagers to log messages about trying to
> connect to the jobmanager at localhost.  What finally worked was to set the
> value to the hostname where the flink-conf.yaml was individually, even on
> the taskmanagers.
>
> Does this seem like a bug?
>
> Just a hunch, but is there something called an "akka leader" that is
> different from the jobmanager leader, and could it be somehow defaulting
> its value over to jobmanager.rpc.address?
>
>
>

Re: strange behavior with jobmanager.rpc.address on standalone HA cluster

Posted by Till Rohrmann <tr...@apache.org>.
Alright, try to grab the logs if you see this problem reoccurring. I would
be interested in understanding why this happens.

Cheers,
Till

On Fri, May 18, 2018 at 9:45 PM, Derek VerLee <de...@gmail.com> wrote:

> Till,
>
> Thanks for the response.  Sorry for the delayed reply.
>
> The flink version is 1.3.2, in stand alone mode.  We'll probably upgrade
> to 1.4, or directly to 1.5 once it is release in the very near future, and
> I intend to migrate to running it on our Kubernetes cluster, and I will
> probably run just on job manager as that seems to be the most frequent
> recommendation.
>
> I'm not sure I have logs anymore ... we are very actively working against
> our development environment and debug logs where crashing our log
> aggregation service, so I had to stop forwarding them and turn on an
> aggressive log rotate.  We've been crunched under a deadline for our first
> anomaly detection pipeline.
>
>   At the time, nothing much jumped out in the logs to help me, except that
> I did remember seeing some messages that seems to be looking for an "akka
> leader" at whatever I put into the job manager rpc address at.  I have this
> in my search history "akka.actor.ActorNotFound".
> Sorry I don't have something more useful.
>
>
> On 5/13/18 3:50 PM, Till Rohrmann wrote:
>
> Hi Derek,
>
> given that you've started the different Flink cluster components all with
> the same HA enabled configuration, the TMs should be able to connect to jm1
> after you've killed jm0. The jobmanager.rpc.address should not be used when
> HA mode is enabled.
>
> In order to get to the bottom of the described problem, it would be
> tremendously helpful to get access to the logs of all components (jm0, jm1
> and the TMs). Additionally, it would be good to know which Flink version
> you're using.
>
> Cheers,
> Till
>
> On Mon, May 7, 2018 at 2:38 PM, Fabian Hueske <fh...@gmail.com> wrote:
>
>> Hi Derek,
>>
>> 1. I've created a JIRA issue to improve the docs as you recommended [1].
>>
>> 2. This discussion goes quite a bit into the internals of the HA setup.
>> Let me pull in Till (in CC) who knows the details of HA.
>>
>> Best, Fabian
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-9309
>>
>> 2018-05-05 15:34 GMT+02:00 Derek VerLee <de...@gmail.com>:
>>
>>> Two things:
>>>
>>> 1. It would be beneficial I think to drop a line somewhere in the docs
>>> (probably on the production ready checklist as well as the HA page)
>>> explaining that enabling zookeeper "highavailability" allows for your jobs
>>> to restart automatically after a jobmanager crash or restart.  We had spent
>>> some cycles trying to implement job restarting and watchdogs (poorly) when
>>> I discoverd this from a flink forward presentation on youtube.
>>>
>>> 2. I seem to have found some odd behavior with HA and then found
>>> something that works, but I can't explain why.  The clifnotes version is
>>> that I took an existing standalone cluster with a single JM and modified
>>> with high availability zookeeper mode.  The same flink-conf.yaml file is
>>> used on all nodes (including JM). This seemed to work fine, I restarted the
>>> JM (jm0) and the jobs relaunched when it came back.  Easy!  Then I deployed
>>> a second JM (jm1).  Once I modified `masters`, set the HA rpc port range
>>> and opened those ports on the firewall for both jobmanagers, but left
>>> `jobmanager.rpc.address` the original value, `jm0` on all nodes.  I then
>>> observed that jm0 worked fine, taskmanagers connected to it and jobs ran.
>>> jm1 did not 301 me to jm0 however, it displayed a dashboard (no jobs, no
>>> tm).  When I stopped jm0, the jobs show up on jm1 as RESTARTING, but the
>>> taskmanagers never attach to jm1.   In the logs, all nodes, including jm1,
>>> had messages about trying to reach jm0.  From the documentation and various
>>> comments I've seen, `jobmanager.rpc.address` should be ignored.  However,
>>> commenting it out entirely lead to jobmanagers crashing at boot, setting to
>>> `localhost` caused all the taskmanagers to log messages about trying to
>>> connect to the jobmanager at localhost.  What finally worked was to set the
>>> value to the hostname where the flink-conf.yaml was individually, even on
>>> the taskmanagers.
>>>
>>> Does this seem like a bug?
>>>
>>> Just a hunch, but is there something called an "akka leader" that is
>>> different from the jobmanager leader, and could it be somehow defaulting
>>> its value over to jobmanager.rpc.address?
>>>
>>>
>>>
>>
>
>

Re: strange behavior with jobmanager.rpc.address on standalone HA cluster

Posted by Till Rohrmann <tr...@apache.org>.
Hi Derek,

given that you've started the different Flink cluster components all with
the same HA enabled configuration, the TMs should be able to connect to jm1
after you've killed jm0. The jobmanager.rpc.address should not be used when
HA mode is enabled.

In order to get to the bottom of the described problem, it would be
tremendously helpful to get access to the logs of all components (jm0, jm1
and the TMs). Additionally, it would be good to know which Flink version
you're using.

Cheers,
Till

On Mon, May 7, 2018 at 2:38 PM, Fabian Hueske <fh...@gmail.com> wrote:

> Hi Derek,
>
> 1. I've created a JIRA issue to improve the docs as you recommended [1].
>
> 2. This discussion goes quite a bit into the internals of the HA setup.
> Let me pull in Till (in CC) who knows the details of HA.
>
> Best, Fabian
>
> [1] https://issues.apache.org/jira/browse/FLINK-9309
>
> 2018-05-05 15:34 GMT+02:00 Derek VerLee <de...@gmail.com>:
>
>> Two things:
>>
>> 1. It would be beneficial I think to drop a line somewhere in the docs
>> (probably on the production ready checklist as well as the HA page)
>> explaining that enabling zookeeper "highavailability" allows for your jobs
>> to restart automatically after a jobmanager crash or restart.  We had spent
>> some cycles trying to implement job restarting and watchdogs (poorly) when
>> I discoverd this from a flink forward presentation on youtube.
>>
>> 2. I seem to have found some odd behavior with HA and then found
>> something that works, but I can't explain why.  The clifnotes version is
>> that I took an existing standalone cluster with a single JM and modified
>> with high availability zookeeper mode.  The same flink-conf.yaml file is
>> used on all nodes (including JM). This seemed to work fine, I restarted the
>> JM (jm0) and the jobs relaunched when it came back.  Easy!  Then I deployed
>> a second JM (jm1).  Once I modified `masters`, set the HA rpc port range
>> and opened those ports on the firewall for both jobmanagers, but left
>> `jobmanager.rpc.address` the original value, `jm0` on all nodes.  I then
>> observed that jm0 worked fine, taskmanagers connected to it and jobs ran.
>> jm1 did not 301 me to jm0 however, it displayed a dashboard (no jobs, no
>> tm).  When I stopped jm0, the jobs show up on jm1 as RESTARTING, but the
>> taskmanagers never attach to jm1.   In the logs, all nodes, including jm1,
>> had messages about trying to reach jm0.  From the documentation and various
>> comments I've seen, `jobmanager.rpc.address` should be ignored.  However,
>> commenting it out entirely lead to jobmanagers crashing at boot, setting to
>> `localhost` caused all the taskmanagers to log messages about trying to
>> connect to the jobmanager at localhost.  What finally worked was to set the
>> value to the hostname where the flink-conf.yaml was individually, even on
>> the taskmanagers.
>>
>> Does this seem like a bug?
>>
>> Just a hunch, but is there something called an "akka leader" that is
>> different from the jobmanager leader, and could it be somehow defaulting
>> its value over to jobmanager.rpc.address?
>>
>>
>>
>