You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Xiangyu Su <xi...@smaato.com> on 2021/09/01 07:30:23 UTC

FLINK-14316 happens on version 1.13.2

Hello Everyone,
We upgrade flink to 1.13.2, and we were facing randomly the "Job leader ...
lost leadership" error, the job keep restarting and failing...
It behaviours like this ticket
https://issues.apache.org/jira/browse/FLINK-14316

Did anybody had same issue or any suggestions?

Best Regards,


-- 
Xiangyu Su
Java Developer
xiangyu@smaato.com

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:

Barcastraße 5

22087 Hamburg

Germany
M 0049(176)43330282

The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not
the intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error,
please notify the sender and delete/destroy the original message and any
copy of it from your computer or paper files.

Re: FLINK-14316 happens on version 1.13.2

Posted by Xiangyu Su <xi...@smaato.com>.
Hi Guys,
sorry for the late reply.
we found out the issue is not related to flink, there is a connection issue
with zookeeper. we deploy our whole infra on k8s and using aws spot ec2,
once the pod
 get restarted or lost spot instances we lost the log files... so sorry for
not being able to share the log files.

Sharing some of our experiences:
Job leader lost leadership issues can be caused due to different reasons,
most properly due to zookeeper and this failure does not cause job failure
as far as we have seen.
And the Checkpoint timeout issue can also be due to zookeeper issue,
because the last successful CK meta info is stored in ZK, if ZK has issue
flink will not be able to restore from last CK..

On Tue, 7 Sept 2021 at 10:00, Matthias Pohl <ma...@ververica.com> wrote:

> Hi Xiangyu,
> thanks for reaching out to the community. Could you share the entire
> TaskManager and JobManager logs with us? That might help investigating
> what's going on.
>
> Best,
> Matthias
>
> On Fri, Sep 3, 2021 at 10:07 AM Xiangyu Su <xi...@smaato.com> wrote:
>
>> Hi Yun,
>> Thanks alot.
>> I am running a test, and facing the "Job Leader lost leadership..."
>> issue, and also the checkpointing timeout at the same time,, not sure
>> whether those 2 things related to each other.
>> regarding your question:
>> 1. GC looks ok.
>> 2. seems like once the "Job Leader lost leadership..." happens flink job
>> can not successfully get restarted.
>> and e.g here is some logs from one job failure:
>> ---------------
>> 2021-09-02 20:41:11,345 WARN  org.apache.flink.runtime.taskmanager.Task
>>                  [] - KeyedProcess -> Sink: StatsdMetricsSink (40/48)#18
>> (9ab62cc148569e449fdb31b521ec976c) switched from RUNNING to FAILED with
>> failure cause: org.apache.flink.util.FlinkException: Disconnect from
>> JobManager responsible for ec6fd88643747aafac06ee906e421a96.
>> at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1660)
>> at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1500(TaskExecutor.java:181)
>> at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:2189)
>> at java.util.Optional.ifPresent(Optional.java:159)
>> at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3(TaskExecutor.java:2187)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>> at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>> at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>> at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>> at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> at
>> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> at
>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> Caused by: java.lang.Exception: Job leader for job id
>> ec6fd88643747aafac06ee906e421a96 lost leadership.
>> ... 24 more
>>
>> -----------
>> 2021-09-02 20:47:22,388 ERROR
>> org.apache.flink.shaded.curator4.org.apache.curator.ConnectionState [] -
>> Authentication failed
>> 2021-09-02 20:47:22,388 INFO
>>  org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
>> Opening socket connection to server dpl-zookeeper-0.dpl-zookeeper/
>> 10.168.175.10:2181
>> 2021-09-02 20:47:22,388 WARN
>>  org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
>> SASL configuration failed: javax.security.auth.login.LoginException: No
>> JAAS configuration section named 'Client' was found in specified JAAS
>> configuration file: '/tmp/jaas-4480663428736118963.conf'. Will continue
>> connection to Zookeeper server without SASL authentication, if Zookeeper
>> server allows it.
>> at
>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at
>> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
>> at
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212)
>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305)
>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
>> at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_302]
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> ~[?:1.8.0_302]
>> at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) ~[?:?]
>> at
>> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:441)
>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
>> org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution
>> attempt deb6e9dd535069eb66e2139fde5b77cd was not found.
>> 2021-09-02 20:47:21,870 INFO
>>  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Cannot
>> find task to fail for execution deb6e9dd535069eb66e2139fde5b77cd with
>> exception:
>> at
>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>>     at
>> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at
>> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>> [flink-dist_2.11-1.13.2.jar:1.13.2]
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
>> at
>> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212)
>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305)
>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
>> at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_302]
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> ~[?:1.8.0_302]
>> at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) ~[?:?]
>> at
>> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:441)
>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
>>
>> Thanks for your support.
>> Best Regards,
>>
>> On Thu, 2 Sept 2021 at 16:43, Yun Gao <yu...@aliyun.com> wrote:
>>
>>> Hi Xiangyu,
>>>
>>> There might be different reasons for the "Job Leader... lost leadership"
>>> problem. Do you see the erros
>>> in the TM log ? If so, the root cause might be that the connection
>>> between the TM and ZK is lost or
>>> timeout. Have you checked the GC status of the TM side ? If the GC is
>>> ok, could you provide more detailed
>>> exception stack ?
>>>
>>> Best,
>>> Yun
>>>
>>>
>>> ------------------Original Mail ------------------
>>> *Sender:*Xiangyu Su <xi...@smaato.com>
>>> *Send Date:*Wed Sep 1 15:31:03 2021
>>> *Recipients:*user <us...@flink.apache.org>
>>> *Subject:*FLINK-14316 happens on version 1.13.2
>>>
>>>> Hello Everyone,
>>>> We upgrade flink to 1.13.2, and we were facing randomly the "Job leader
>>>> ... lost leadership" error, the job keep restarting and failing...
>>>> It behaviours like this ticket
>>>> https://issues.apache.org/jira/browse/FLINK-14316
>>>>
>>>> Did anybody had same issue or any suggestions?
>>>>
>>>> Best Regards,
>>>>
>>>>
>>>> --
>>>> Xiangyu Su
>>>> Java Developer
>>>> xiangyu@smaato.com
>>>>
>>>> Smaato Inc.
>>>> San Francisco - New York - Hamburg - Singapore
>>>> www.smaato.com
>>>>
>>>> Germany:
>>>>
>>>> Barcastraße 5
>>>>
>>>> 22087 Hamburg
>>>>
>>>> Germany
>>>> M 0049(176)43330282
>>>>
>>>> The information contained in this communication may be CONFIDENTIAL and
>>>> is intended only for the use of the recipient(s) named above. If you are
>>>> not the intended recipient, you are hereby notified that any dissemination,
>>>> distribution, or copying of this communication, or any of its contents, is
>>>> strictly prohibited. If you have received this communication in error,
>>>> please notify the sender and delete/destroy the original message and any
>>>> copy of it from your computer or paper files.
>>>>
>>>
>>
>> --
>> Xiangyu Su
>> Java Developer
>> xiangyu@smaato.com
>>
>> Smaato Inc.
>> San Francisco - New York - Hamburg - Singapore
>> www.smaato.com
>>
>> Germany:
>>
>> Barcastraße 5
>>
>> 22087 Hamburg
>>
>> Germany
>> M 0049(176)43330282
>>
>> The information contained in this communication may be CONFIDENTIAL and
>> is intended only for the use of the recipient(s) named above. If you are
>> not the intended recipient, you are hereby notified that any dissemination,
>> distribution, or copying of this communication, or any of its contents, is
>> strictly prohibited. If you have received this communication in error,
>> please notify the sender and delete/destroy the original message and any
>> copy of it from your computer or paper files.
>>
>
>

-- 
Xiangyu Su
Java Developer
xiangyu@smaato.com

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:

Barcastraße 5

22087 Hamburg

Germany
M 0049(176)43330282

The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not
the intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error,
please notify the sender and delete/destroy the original message and any
copy of it from your computer or paper files.

Re: FLINK-14316 happens on version 1.13.2

Posted by Matthias Pohl <ma...@ververica.com>.
Hi Xiangyu,
thanks for reaching out to the community. Could you share the entire
TaskManager and JobManager logs with us? That might help investigating
what's going on.

Best,
Matthias

On Fri, Sep 3, 2021 at 10:07 AM Xiangyu Su <xi...@smaato.com> wrote:

> Hi Yun,
> Thanks alot.
> I am running a test, and facing the "Job Leader lost leadership..." issue,
> and also the checkpointing timeout at the same time,, not sure whether
> those 2 things related to each other.
> regarding your question:
> 1. GC looks ok.
> 2. seems like once the "Job Leader lost leadership..." happens flink job
> can not successfully get restarted.
> and e.g here is some logs from one job failure:
> ---------------
> 2021-09-02 20:41:11,345 WARN  org.apache.flink.runtime.taskmanager.Task
>                  [] - KeyedProcess -> Sink: StatsdMetricsSink (40/48)#18
> (9ab62cc148569e449fdb31b521ec976c) switched from RUNNING to FAILED with
> failure cause: org.apache.flink.util.FlinkException: Disconnect from
> JobManager responsible for ec6fd88643747aafac06ee906e421a96.
> at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1660)
> at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1500(TaskExecutor.java:181)
> at
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:2189)
> at java.util.Optional.ifPresent(Optional.java:159)
> at
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3(TaskExecutor.java:2187)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: java.lang.Exception: Job leader for job id
> ec6fd88643747aafac06ee906e421a96 lost leadership.
> ... 24 more
>
> -----------
> 2021-09-02 20:47:22,388 ERROR
> org.apache.flink.shaded.curator4.org.apache.curator.ConnectionState [] -
> Authentication failed
> 2021-09-02 20:47:22,388 INFO
>  org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
> Opening socket connection to server dpl-zookeeper-0.dpl-zookeeper/
> 10.168.175.10:2181
> 2021-09-02 20:47:22,388 WARN
>  org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
> SASL configuration failed: javax.security.auth.login.LoginException: No
> JAAS configuration section named 'Client' was found in specified JAAS
> configuration file: '/tmp/jaas-4480663428736118963.conf'. Will continue
> connection to Zookeeper server without SASL authentication, if Zookeeper
> server allows it.
> at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_302]
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> ~[?:1.8.0_302]
> at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) ~[?:?]
> at
> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:441)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution
> attempt deb6e9dd535069eb66e2139fde5b77cd was not found.
> 2021-09-02 20:47:21,870 INFO
>  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Cannot
> find task to fail for execution deb6e9dd535069eb66e2139fde5b77cd with
> exception:
> at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
>     at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_302]
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> ~[?:1.8.0_302]
> at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) ~[?:?]
> at
> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:441)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
>
> Thanks for your support.
> Best Regards,
>
> On Thu, 2 Sept 2021 at 16:43, Yun Gao <yu...@aliyun.com> wrote:
>
>> Hi Xiangyu,
>>
>> There might be different reasons for the "Job Leader... lost leadership"
>> problem. Do you see the erros
>> in the TM log ? If so, the root cause might be that the connection
>> between the TM and ZK is lost or
>> timeout. Have you checked the GC status of the TM side ? If the GC is ok,
>> could you provide more detailed
>> exception stack ?
>>
>> Best,
>> Yun
>>
>>
>> ------------------Original Mail ------------------
>> *Sender:*Xiangyu Su <xi...@smaato.com>
>> *Send Date:*Wed Sep 1 15:31:03 2021
>> *Recipients:*user <us...@flink.apache.org>
>> *Subject:*FLINK-14316 happens on version 1.13.2
>>
>>> Hello Everyone,
>>> We upgrade flink to 1.13.2, and we were facing randomly the "Job leader
>>> ... lost leadership" error, the job keep restarting and failing...
>>> It behaviours like this ticket
>>> https://issues.apache.org/jira/browse/FLINK-14316
>>>
>>> Did anybody had same issue or any suggestions?
>>>
>>> Best Regards,
>>>
>>>
>>> --
>>> Xiangyu Su
>>> Java Developer
>>> xiangyu@smaato.com
>>>
>>> Smaato Inc.
>>> San Francisco - New York - Hamburg - Singapore
>>> www.smaato.com
>>>
>>> Germany:
>>>
>>> Barcastraße 5
>>>
>>> 22087 Hamburg
>>>
>>> Germany
>>> M 0049(176)43330282
>>>
>>> The information contained in this communication may be CONFIDENTIAL and
>>> is intended only for the use of the recipient(s) named above. If you are
>>> not the intended recipient, you are hereby notified that any dissemination,
>>> distribution, or copying of this communication, or any of its contents, is
>>> strictly prohibited. If you have received this communication in error,
>>> please notify the sender and delete/destroy the original message and any
>>> copy of it from your computer or paper files.
>>>
>>
>
> --
> Xiangyu Su
> Java Developer
> xiangyu@smaato.com
>
> Smaato Inc.
> San Francisco - New York - Hamburg - Singapore
> www.smaato.com
>
> Germany:
>
> Barcastraße 5
>
> 22087 Hamburg
>
> Germany
> M 0049(176)43330282
>
> The information contained in this communication may be CONFIDENTIAL and is
> intended only for the use of the recipient(s) named above. If you are not
> the intended recipient, you are hereby notified that any dissemination,
> distribution, or copying of this communication, or any of its contents, is
> strictly prohibited. If you have received this communication in error,
> please notify the sender and delete/destroy the original message and any
> copy of it from your computer or paper files.
>

Re: FLINK-14316 happens on version 1.13.2

Posted by Xiangyu Su <xi...@smaato.com>.
Hi Yun,
Thanks alot.
I am running a test, and facing the "Job Leader lost leadership..." issue,
and also the checkpointing timeout at the same time,, not sure whether
those 2 things related to each other.
regarding your question:
1. GC looks ok.
2. seems like once the "Job Leader lost leadership..." happens flink job
can not successfully get restarted.
and e.g here is some logs from one job failure:
---------------
2021-09-02 20:41:11,345 WARN  org.apache.flink.runtime.taskmanager.Task
               [] - KeyedProcess -> Sink: StatsdMetricsSink (40/48)#18
(9ab62cc148569e449fdb31b521ec976c) switched from RUNNING to FAILED with
failure cause: org.apache.flink.util.FlinkException: Disconnect from
JobManager responsible for ec6fd88643747aafac06ee906e421a96.
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1660)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1500(TaskExecutor.java:181)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:2189)
at java.util.Optional.ifPresent(Optional.java:159)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3(TaskExecutor.java:2187)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.Exception: Job leader for job id
ec6fd88643747aafac06ee906e421a96 lost leadership.
... 24 more

-----------
2021-09-02 20:47:22,388 ERROR
org.apache.flink.shaded.curator4.org.apache.curator.ConnectionState [] -
Authentication failed
2021-09-02 20:47:22,388 INFO
 org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Opening socket connection to server dpl-zookeeper-0.dpl-zookeeper/
10.168.175.10:2181
2021-09-02 20:47:22,388 WARN
 org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
SASL configuration failed: javax.security.auth.login.LoginException: No
JAAS configuration section named 'Client' was found in specified JAAS
configuration file: '/tmp/jaas-4480663428736118963.conf'. Will continue
connection to Zookeeper server without SASL authentication, if Zookeeper
server allows it.
at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_302]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[?:1.8.0_302]
at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) ~[?:?]
at
org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:441)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution
attempt deb6e9dd535069eb66e2139fde5b77cd was not found.
2021-09-02 20:47:21,870 INFO
 org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Cannot
find task to fail for execution deb6e9dd535069eb66e2139fde5b77cd with
exception:
at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[flink-dist_2.11-1.13.2.jar:1.13.2]
    at
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_302]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[?:1.8.0_302]
at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) ~[?:?]
at
org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:441)
~[flink-dist_2.11-1.13.2.jar:1.13.2]

Thanks for your support.
Best Regards,

On Thu, 2 Sept 2021 at 16:43, Yun Gao <yu...@aliyun.com> wrote:

> Hi Xiangyu,
>
> There might be different reasons for the "Job Leader... lost leadership"
> problem. Do you see the erros
> in the TM log ? If so, the root cause might be that the connection between
> the TM and ZK is lost or
> timeout. Have you checked the GC status of the TM side ? If the GC is ok,
> could you provide more detailed
> exception stack ?
>
> Best,
> Yun
>
>
> ------------------Original Mail ------------------
> *Sender:*Xiangyu Su <xi...@smaato.com>
> *Send Date:*Wed Sep 1 15:31:03 2021
> *Recipients:*user <us...@flink.apache.org>
> *Subject:*FLINK-14316 happens on version 1.13.2
>
>> Hello Everyone,
>> We upgrade flink to 1.13.2, and we were facing randomly the "Job leader
>> ... lost leadership" error, the job keep restarting and failing...
>> It behaviours like this ticket
>> https://issues.apache.org/jira/browse/FLINK-14316
>>
>> Did anybody had same issue or any suggestions?
>>
>> Best Regards,
>>
>>
>> --
>> Xiangyu Su
>> Java Developer
>> xiangyu@smaato.com
>>
>> Smaato Inc.
>> San Francisco - New York - Hamburg - Singapore
>> www.smaato.com
>>
>> Germany:
>>
>> Barcastraße 5
>>
>> 22087 Hamburg
>>
>> Germany
>> M 0049(176)43330282
>>
>> The information contained in this communication may be CONFIDENTIAL and
>> is intended only for the use of the recipient(s) named above. If you are
>> not the intended recipient, you are hereby notified that any dissemination,
>> distribution, or copying of this communication, or any of its contents, is
>> strictly prohibited. If you have received this communication in error,
>> please notify the sender and delete/destroy the original message and any
>> copy of it from your computer or paper files.
>>
>

-- 
Xiangyu Su
Java Developer
xiangyu@smaato.com

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:

Barcastraße 5

22087 Hamburg

Germany
M 0049(176)43330282

The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not
the intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error,
please notify the sender and delete/destroy the original message and any
copy of it from your computer or paper files.

Re: FLINK-14316 happens on version 1.13.2

Posted by Yun Gao <yu...@aliyun.com>.
Hi Xiangyu,

There might be different reasons for the "Job Leader... lost leadership" problem. Do you see the erros
in the TM log ? If so, the root cause might be that the connection between the TM and ZK is lost or
timeout. Have you checked the GC status of the TM side ? If the GC is ok, could you provide more detailed
exception stack ?

Best,
Yun



 ------------------Original Mail ------------------
Sender:Xiangyu Su <xi...@smaato.com>
Send Date:Wed Sep 1 15:31:03 2021
Recipients:user <us...@flink.apache.org>
Subject:FLINK-14316 happens on version 1.13.2

Hello Everyone,
We upgrade flink to 1.13.2, and we were facing randomly the "Job leader ... lost leadership" error, the job keep restarting and failing...
It behaviours like this ticket https://issues.apache.org/jira/browse/FLINK-14316

Did anybody had same issue or any suggestions?

Best Regards,


-- 
Xiangyu Su
Java Developer
xiangyu@smaato.com

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Barcastraße 5
22087 Hamburg
GermanyM 0049(176)43330282

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.