You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Chen Liangde <li...@gmail.com> on 2020/10/29 21:30:22 UTC

Native kubernetes setup failed to start job

I created a flink cluster in kubernetes following this guide:
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html

The job manager was running. When a job was submitted to the job manager,
it spawned a task manager pod, but the task manager failed to connect to
the job manager. And in the job manager web ui I can't find the task
manager.

This error is
suspicious: org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException:
Adjusted frame length exceeds 10485760: 352518404 - discarded

2020-10-29 13:22:51,069 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] -
Connecting to ResourceManager
akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).
2020-10-29 13:22:51,176 WARN
akka.remote.transport.netty.NettyTransport                   [] -
Remote connection to
[detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with
java.io.IOException: Connection reset by peer
2020-10-29 13:22:51,176 WARN
akka.remote.transport.netty.NettyTransport                   [] -
Remote connection to
[detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with
org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException:
Adjusted frame length exceeds 10485760: 352518404 - discarded
2020-10-29 13:22:51,180 WARN  akka.remote.ReliableDeliverySupervisor
                    [] - Association with remote system
[akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123] has
failed, address is now gated for [50] ms. Reason: [Association failed
with [akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123]]
Caused by: [The remote system explicitly disassociated (reason
unknown).]
2020-10-29 13:22:51,183 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] -
Could not resolve ResourceManager address
akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*,
retrying in 10000 ms: Could not connect to rpc endpoint under address
akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*.
2020-10-29 13:23:01,203 WARN
akka.remote.transport.netty.NettyTransport                   [] -
Remote connection to
[detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with
java.io.IOException: Connection reset by peer

Re: Native kubernetes setup failed to start job

Posted by Yang Wang <da...@gmail.com>.
Sorry, I overlooked the logs for detection-engine-dev-taskmanager-1-1.

Could you start a busybox to check the connectivity for the K8s service
"detection-engine-dev"?
It seems that the TaskManager try to connect and get a response "Connection
reset by peer".

Best,
Yang

Yang Wang <da...@gmail.com> 于2020年11月2日周一 下午5:41写道:

> Hi Liangde Chen,
>
> Thanks for providing the logs. After checking the logs, I am afraid that
> there is something wrong with
> your K8s cluster. Since detection-engine-dev-taskmanager-1-2 has been
> started and registered to JobManager
> successfully.
>
> I suggest finding which K8s node detection-engine-dev-taskmanager-1-1 is
> running on and disable
> the scheduling on it. Then restart the Flink K8s session and have a try
> again.
>
> Best,
> Yang
>
> Chen Liangde <li...@gmail.com> 于2020年11月2日周一 下午3:55写道:
>
>> Please find attached logs.
>>
>> The kubernetes cluster is an aws EKS cluster but managed by our infra's
>> team.
>> I created a service account "flink" for it and it has permission to
>> create, list, delete pods along with  some other types of resources in the
>> "team-anti-cheat" namespace.
>>
>> Below command was used to create the flink cluster:
>> ./bin/kubernetes-session.sh \
>>         -Dexecution.attached=true \
>>         -Dkubernetes.cluster-id=detection-engine-dev \
>>         -Dkubernetes.namespace=team-anti-cheat \
>>         -Dkubernetes.container-start-command-template="%java% %classpath%
>> %jvmmem% %jvmopts% %logging% %class% %args%" \
>>         -Dkubernetes.jobmanager.service-account=flink
>>
>> Thanks
>> Liangde Chen
>>
>>
>> On Mon, 2 Nov 2020 at 08:20, Yang Wang <da...@gmail.com> wrote:
>>
>>> Could you share the JobManager logs so that we could check whether it
>>> received the
>>> registration from TasManager?
>>>
>>> In a non-HA Flink cluster, the TaskManager is using the service to talk
>>> to JobManager.
>>> Currently, Flink creates a headless service for JobManager. You could
>>> use `kubectl get svc`
>>> to find it. And then start a busybox to check the network connectivity.
>>>
>>> And maybe you could share more information about the environment. I
>>> could not reproduce
>>> your issue in a typical K8s cluster.
>>>
>>> Best,
>>> Yang
>>>
>>> Yun Gao <yu...@aliyun.com> 于2020年10月30日周五 上午11:53写道:
>>>
>>>> Hi Liangde,
>>>>
>>>>    I pull in Yang Wang who is the expert for Flink on K8s.
>>>>
>>>> Best,
>>>>  Yun
>>>>
>>>> ------------------Original Mail ------------------
>>>> *Sender:*Chen Liangde <li...@gmail.com>
>>>> *Send Date:*Fri Oct 30 05:30:40 2020
>>>> *Recipients:*Flink ML <us...@flink.apache.org>
>>>> *Subject:*Native kubernetes setup failed to start job
>>>>
>>>>> I created a flink cluster in kubernetes following this guide:
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html
>>>>>
>>>>> The job manager was running. When a job was submitted to the job
>>>>> manager, it spawned a task manager pod, but the task manager failed to
>>>>> connect to the job manager. And in the job manager web ui I can't find the
>>>>> task manager.
>>>>>
>>>>> This error is
>>>>> suspicious: org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException:
>>>>> Adjusted frame length exceeds 10485760: 352518404 - discarded
>>>>>
>>>>> 2020-10-29 13:22:51,069 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Connecting to ResourceManager akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded2020-10-29 13:22:51,180 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123]] Caused by: [The remote system explicitly disassociated (reason unknown).]2020-10-29 13:22:51,183 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Could not resolve ResourceManager address akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*.2020-10-29 13:23:01,203 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer
>>>>>
>>>>>

Re: Native kubernetes setup failed to start job

Posted by Yang Wang <da...@gmail.com>.
Hi Liangde Chen,

Thanks for providing the logs. After checking the logs, I am afraid that
there is something wrong with
your K8s cluster. Since detection-engine-dev-taskmanager-1-2 has been
started and registered to JobManager
successfully.

I suggest finding which K8s node detection-engine-dev-taskmanager-1-1 is
running on and disable
the scheduling on it. Then restart the Flink K8s session and have a try
again.

Best,
Yang

Chen Liangde <li...@gmail.com> 于2020年11月2日周一 下午3:55写道:

> Please find attached logs.
>
> The kubernetes cluster is an aws EKS cluster but managed by our infra's
> team.
> I created a service account "flink" for it and it has permission to
> create, list, delete pods along with  some other types of resources in the
> "team-anti-cheat" namespace.
>
> Below command was used to create the flink cluster:
> ./bin/kubernetes-session.sh \
>         -Dexecution.attached=true \
>         -Dkubernetes.cluster-id=detection-engine-dev \
>         -Dkubernetes.namespace=team-anti-cheat \
>         -Dkubernetes.container-start-command-template="%java% %classpath%
> %jvmmem% %jvmopts% %logging% %class% %args%" \
>         -Dkubernetes.jobmanager.service-account=flink
>
> Thanks
> Liangde Chen
>
>
> On Mon, 2 Nov 2020 at 08:20, Yang Wang <da...@gmail.com> wrote:
>
>> Could you share the JobManager logs so that we could check whether it
>> received the
>> registration from TasManager?
>>
>> In a non-HA Flink cluster, the TaskManager is using the service to talk
>> to JobManager.
>> Currently, Flink creates a headless service for JobManager. You could use
>> `kubectl get svc`
>> to find it. And then start a busybox to check the network connectivity.
>>
>> And maybe you could share more information about the environment. I could
>> not reproduce
>> your issue in a typical K8s cluster.
>>
>> Best,
>> Yang
>>
>> Yun Gao <yu...@aliyun.com> 于2020年10月30日周五 上午11:53写道:
>>
>>> Hi Liangde,
>>>
>>>    I pull in Yang Wang who is the expert for Flink on K8s.
>>>
>>> Best,
>>>  Yun
>>>
>>> ------------------Original Mail ------------------
>>> *Sender:*Chen Liangde <li...@gmail.com>
>>> *Send Date:*Fri Oct 30 05:30:40 2020
>>> *Recipients:*Flink ML <us...@flink.apache.org>
>>> *Subject:*Native kubernetes setup failed to start job
>>>
>>>> I created a flink cluster in kubernetes following this guide:
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html
>>>>
>>>> The job manager was running. When a job was submitted to the job
>>>> manager, it spawned a task manager pod, but the task manager failed to
>>>> connect to the job manager. And in the job manager web ui I can't find the
>>>> task manager.
>>>>
>>>> This error is
>>>> suspicious: org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException:
>>>> Adjusted frame length exceeds 10485760: 352518404 - discarded
>>>>
>>>> 2020-10-29 13:22:51,069 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Connecting to ResourceManager akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded2020-10-29 13:22:51,180 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123]] Caused by: [The remote system explicitly disassociated (reason unknown).]2020-10-29 13:22:51,183 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Could not resolve ResourceManager address akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*.2020-10-29 13:23:01,203 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer
>>>>
>>>>

Re: Native kubernetes setup failed to start job

Posted by Chen Liangde <li...@gmail.com>.
Please find attached logs.

The kubernetes cluster is an aws EKS cluster but managed by our infra's
team.
I created a service account "flink" for it and it has permission to create,
list, delete pods along with  some other types of resources in the
"team-anti-cheat" namespace.

Below command was used to create the flink cluster:
./bin/kubernetes-session.sh \
        -Dexecution.attached=true \
        -Dkubernetes.cluster-id=detection-engine-dev \
        -Dkubernetes.namespace=team-anti-cheat \
        -Dkubernetes.container-start-command-template="%java% %classpath%
%jvmmem% %jvmopts% %logging% %class% %args%" \
        -Dkubernetes.jobmanager.service-account=flink

Thanks
Liangde Chen


On Mon, 2 Nov 2020 at 08:20, Yang Wang <da...@gmail.com> wrote:

> Could you share the JobManager logs so that we could check whether it
> received the
> registration from TasManager?
>
> In a non-HA Flink cluster, the TaskManager is using the service to talk to
> JobManager.
> Currently, Flink creates a headless service for JobManager. You could use
> `kubectl get svc`
> to find it. And then start a busybox to check the network connectivity.
>
> And maybe you could share more information about the environment. I could
> not reproduce
> your issue in a typical K8s cluster.
>
> Best,
> Yang
>
> Yun Gao <yu...@aliyun.com> 于2020年10月30日周五 上午11:53写道:
>
>> Hi Liangde,
>>
>>    I pull in Yang Wang who is the expert for Flink on K8s.
>>
>> Best,
>>  Yun
>>
>> ------------------Original Mail ------------------
>> *Sender:*Chen Liangde <li...@gmail.com>
>> *Send Date:*Fri Oct 30 05:30:40 2020
>> *Recipients:*Flink ML <us...@flink.apache.org>
>> *Subject:*Native kubernetes setup failed to start job
>>
>>> I created a flink cluster in kubernetes following this guide:
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html
>>>
>>> The job manager was running. When a job was submitted to the job
>>> manager, it spawned a task manager pod, but the task manager failed to
>>> connect to the job manager. And in the job manager web ui I can't find the
>>> task manager.
>>>
>>> This error is
>>> suspicious: org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException:
>>> Adjusted frame length exceeds 10485760: 352518404 - discarded
>>>
>>> 2020-10-29 13:22:51,069 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Connecting to ResourceManager akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded2020-10-29 13:22:51,180 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123]] Caused by: [The remote system explicitly disassociated (reason unknown).]2020-10-29 13:22:51,183 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Could not resolve ResourceManager address akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*.2020-10-29 13:23:01,203 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer
>>>
>>>

Re: Native kubernetes setup failed to start job

Posted by Yang Wang <da...@gmail.com>.
Could you share the JobManager logs so that we could check whether it
received the
registration from TasManager?

In a non-HA Flink cluster, the TaskManager is using the service to talk to
JobManager.
Currently, Flink creates a headless service for JobManager. You could use
`kubectl get svc`
to find it. And then start a busybox to check the network connectivity.

And maybe you could share more information about the environment. I could
not reproduce
your issue in a typical K8s cluster.

Best,
Yang

Yun Gao <yu...@aliyun.com> 于2020年10月30日周五 上午11:53写道:

> Hi Liangde,
>
>    I pull in Yang Wang who is the expert for Flink on K8s.
>
> Best,
>  Yun
>
> ------------------Original Mail ------------------
> *Sender:*Chen Liangde <li...@gmail.com>
> *Send Date:*Fri Oct 30 05:30:40 2020
> *Recipients:*Flink ML <us...@flink.apache.org>
> *Subject:*Native kubernetes setup failed to start job
>
>> I created a flink cluster in kubernetes following this guide:
>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html
>>
>> The job manager was running. When a job was submitted to the job manager,
>> it spawned a task manager pod, but the task manager failed to connect to
>> the job manager. And in the job manager web ui I can't find the task
>> manager.
>>
>> This error is
>> suspicious: org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException:
>> Adjusted frame length exceeds 10485760: 352518404 - discarded
>>
>> 2020-10-29 13:22:51,069 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Connecting to ResourceManager akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded2020-10-29 13:22:51,180 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123]] Caused by: [The remote system explicitly disassociated (reason unknown).]2020-10-29 13:22:51,183 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Could not resolve ResourceManager address akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*.2020-10-29 13:23:01,203 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer
>>
>>

Re: Native kubernetes setup failed to start job

Posted by Yun Gao <yu...@aliyun.com>.
Hi Liangde,

   I pull in Yang Wang who is the expert for Flink on K8s.  

Best,
 Yun

 ------------------Original Mail ------------------
Sender:Chen Liangde <li...@gmail.com>
Send Date:Fri Oct 30 05:30:40 2020
Recipients:Flink ML <us...@flink.apache.org>
Subject:Native kubernetes setup failed to start job

I created a flink cluster in kubernetes following this guide: https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html
The job manager was running. When a job was submitted to the job manager, it spawned a task manager pod, but the task manager failed to connect to the job manager. And in the job manager web ui I can't find the task manager.
This error is suspicious: org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded
2020-10-29 13:22:51,069 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Connecting to ResourceManager akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000).2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer2020-10-29 13:22:51,176 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 352518404 - discarded2020-10-29 13:22:51,180 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123]] Caused by: [The remote system explicitly disassociated (reason unknown).]2020-10-29 13:22:51,183 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Could not resolve ResourceManager address akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@detection-engine-dev.team-anti-cheat:6123/user/rpc/resourcemanager_*.2020-10-29 13:23:01,203 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [detection-engine-dev.team-anti-cheat/10.123.155.112:6123] failed with java.io.IOException: Connection reset by peer