You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Claude M <cl...@gmail.com> on 2020/11/03 01:33:04 UTC

Error while retrieving the leader gateway after making Flink config changes

Hello,

I have Flink 1.10.2 installed in a Kubernetes cluster.
Anytime I make a change to the flink.conf, the Flink jobmanager pod fails
to restart.
For example, I modified the following memory setting in the flink.conf:
jobmanager.memory.flink.size.
After I deploy the change, the pod fails to restart and the following is
seen in the log:

WARN
 org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever  -
Error while retrieving the leader gateway. Retrying to connect to
akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.

The pod can be restored by doing one of the following but these are not
acceptable solutions:

   - Revert the changes made to the flink.conf to the previous settings
   - Remove the Flink Kubernetes deployment before doing a deployment
   - Delete the flink cluster folder in Zookeeper

I don't understand why making any changes in the flink.conf causes this
problem.
Any help is appreciated.


Thank You

Re: Error while retrieving the leader gateway after making Flink config changes

Posted by Claude M <cl...@gmail.com>.

This issue had to do with the update strategy for the Flink deployment.
When I changed it to the following, it will work:

  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1

On Tue, Nov 3, 2020 at 1:39 PM Robert Metzger <rm...@apache.org> wrote:

> Thanks a lot for providing the logs.
>
> My theory of what is happening is the following:
> 1. You are probably increasing the memory for the JobManager, when
> changing the  jobmanager.memory.flink.size configuration value
> 2. Due to this changed memory configuration, Kubernetes, Docker or the
> Linux kernel are killing your JobManager process because it allocates too
> much memory.
>
> Flink should not stop like this. Fatal errors are logged explicitly, kill
> signals are also logged.
> Can you check Kubernetes, Docker, Linux for any signs that they are
> killing your JobManager?
>
>
>
> On Tue, Nov 3, 2020 at 7:06 PM Claude M <cl...@gmail.com> wrote:
>
>> Thanks for your reply Robert.  Please see attached log from the job
>> manager, the last line is the only thing I see different from a pod that
>> starts up successfully.
>>
>> On Tue, Nov 3, 2020 at 10:41 AM Robert Metzger <rm...@apache.org>
>> wrote:
>>
>>> Hi Claude,
>>>
>>> I agree that you should be able to restart individual pods with a
>>> changed memory configuration. Can you share the full Jobmanager log of the
>>> failed restart attempt?
>>>
>>> I don't think that the log statement you've posted explains a start
>>> failure.
>>>
>>> Regards,
>>> Robert
>>>
>>> On Tue, Nov 3, 2020 at 2:33 AM Claude M <cl...@gmail.com> wrote:
>>>
>>>>
>>>> Hello,
>>>>
>>>> I have Flink 1.10.2 installed in a Kubernetes cluster.
>>>> Anytime I make a change to the flink.conf, the Flink jobmanager pod
>>>> fails to restart.
>>>> For example, I modified the following memory setting in the flink.conf:
>>>> jobmanager.memory.flink.size.
>>>> After I deploy the change, the pod fails to restart and the following
>>>> is seen in the log:
>>>>
>>>> WARN
>>>>  org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever  -
>>>> Error while retrieving the leader gateway. Retrying to connect to
>>>> akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.
>>>>
>>>> The pod can be restored by doing one of the following but these are not
>>>> acceptable solutions:
>>>>
>>>>    - Revert the changes made to the flink.conf to the previous settings
>>>>    - Remove the Flink Kubernetes deployment before doing a deployment
>>>>    - Delete the flink cluster folder in Zookeeper
>>>>
>>>> I don't understand why making any changes in the flink.conf causes this
>>>> problem.
>>>> Any help is appreciated.
>>>>
>>>>
>>>> Thank You
>>>>
>>>

Re: Error while retrieving the leader gateway after making Flink config changes

Posted by Robert Metzger <rm...@apache.org>.

Thanks a lot for providing the logs.

My theory of what is happening is the following:
1. You are probably increasing the memory for the JobManager, when changing
the  jobmanager.memory.flink.size configuration value
2. Due to this changed memory configuration, Kubernetes, Docker or the
Linux kernel are killing your JobManager process because it allocates too
much memory.

Flink should not stop like this. Fatal errors are logged explicitly, kill
signals are also logged.
Can you check Kubernetes, Docker, Linux for any signs that they are killing
your JobManager?



On Tue, Nov 3, 2020 at 7:06 PM Claude M <cl...@gmail.com> wrote:

> Thanks for your reply Robert.  Please see attached log from the job
> manager, the last line is the only thing I see different from a pod that
> starts up successfully.
>
> On Tue, Nov 3, 2020 at 10:41 AM Robert Metzger <rm...@apache.org>
> wrote:
>
>> Hi Claude,
>>
>> I agree that you should be able to restart individual pods with a changed
>> memory configuration. Can you share the full Jobmanager log of the failed
>> restart attempt?
>>
>> I don't think that the log statement you've posted explains a start
>> failure.
>>
>> Regards,
>> Robert
>>
>> On Tue, Nov 3, 2020 at 2:33 AM Claude M <cl...@gmail.com> wrote:
>>
>>>
>>> Hello,
>>>
>>> I have Flink 1.10.2 installed in a Kubernetes cluster.
>>> Anytime I make a change to the flink.conf, the Flink jobmanager pod
>>> fails to restart.
>>> For example, I modified the following memory setting in the flink.conf:
>>> jobmanager.memory.flink.size.
>>> After I deploy the change, the pod fails to restart and the following is
>>> seen in the log:
>>>
>>> WARN
>>>  org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever  -
>>> Error while retrieving the leader gateway. Retrying to connect to
>>> akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.
>>>
>>> The pod can be restored by doing one of the following but these are not
>>> acceptable solutions:
>>>
>>>    - Revert the changes made to the flink.conf to the previous settings
>>>    - Remove the Flink Kubernetes deployment before doing a deployment
>>>    - Delete the flink cluster folder in Zookeeper
>>>
>>> I don't understand why making any changes in the flink.conf causes this
>>> problem.
>>> Any help is appreciated.
>>>
>>>
>>> Thank You
>>>
>>

Re: Error while retrieving the leader gateway after making Flink config changes

Posted by Claude M <cl...@gmail.com>.

Thanks for your reply Robert.  Please see attached log from the job
manager, the last line is the only thing I see different from a pod that
starts up successfully.

On Tue, Nov 3, 2020 at 10:41 AM Robert Metzger <rm...@apache.org> wrote:

> Hi Claude,
>
> I agree that you should be able to restart individual pods with a changed
> memory configuration. Can you share the full Jobmanager log of the failed
> restart attempt?
>
> I don't think that the log statement you've posted explains a start
> failure.
>
> Regards,
> Robert
>
> On Tue, Nov 3, 2020 at 2:33 AM Claude M <cl...@gmail.com> wrote:
>
>>
>> Hello,
>>
>> I have Flink 1.10.2 installed in a Kubernetes cluster.
>> Anytime I make a change to the flink.conf, the Flink jobmanager pod fails
>> to restart.
>> For example, I modified the following memory setting in the flink.conf:
>> jobmanager.memory.flink.size.
>> After I deploy the change, the pod fails to restart and the following is
>> seen in the log:
>>
>> WARN
>>  org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever  -
>> Error while retrieving the leader gateway. Retrying to connect to
>> akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.
>>
>> The pod can be restored by doing one of the following but these are not
>> acceptable solutions:
>>
>>    - Revert the changes made to the flink.conf to the previous settings
>>    - Remove the Flink Kubernetes deployment before doing a deployment
>>    - Delete the flink cluster folder in Zookeeper
>>
>> I don't understand why making any changes in the flink.conf causes this
>> problem.
>> Any help is appreciated.
>>
>>
>> Thank You
>>
>

Re: Error while retrieving the leader gateway after making Flink config changes

Posted by Robert Metzger <rm...@apache.org>.

Hi Claude,

I agree that you should be able to restart individual pods with a changed
memory configuration. Can you share the full Jobmanager log of the failed
restart attempt?

I don't think that the log statement you've posted explains a start failure.

Regards,
Robert

On Tue, Nov 3, 2020 at 2:33 AM Claude M <cl...@gmail.com> wrote:

>
> Hello,
>
> I have Flink 1.10.2 installed in a Kubernetes cluster.
> Anytime I make a change to the flink.conf, the Flink jobmanager pod fails
> to restart.
> For example, I modified the following memory setting in the flink.conf:
> jobmanager.memory.flink.size.
> After I deploy the change, the pod fails to restart and the following is
> seen in the log:
>
> WARN
>  org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever  -
> Error while retrieving the leader gateway. Retrying to connect to
> akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.
>
> The pod can be restored by doing one of the following but these are not
> acceptable solutions:
>
>    - Revert the changes made to the flink.conf to the previous settings
>    - Remove the Flink Kubernetes deployment before doing a deployment
>    - Delete the flink cluster folder in Zookeeper
>
> I don't understand why making any changes in the flink.conf causes this
> problem.
> Any help is appreciated.
>
>
> Thank You
>