You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by mejri houssem <me...@gmail.com> on 2021/09/03 16:15:31 UTC

Job manager crash

Hello ,

I am facing a JM crash lately. I am deploying a flink application cluster
on kubernetes.

When i install my chart using helm everything works fine but after some
time ,the Jm starts to crash

and then it gets deleted eventually after 5 restarts.

flink version: 1.12.5 (upgraded recently from 1.12.2)
HA mode : k8s

Here's the full log of the JM attached file.

Re: Job manager crash

Posted by mejri houssem <me...@gmail.com>.
thanks for the response,

with respect to the api-server i don't think i can do so much about it
because i am just using a specific namespace in kubernetes cluster, it's
not me who administrate the cluster.

otherwise i will try the gc log option to see if can find something useful
in order to debug this problem.



Le jeu. 9 sept. 2021 à 22:25, houssem <me...@gmail.com> a écrit :

> Hello ,
>
> with respect to the api-server i dotn re
>
> On 2021/09/09 11:37:49, Yang Wang <da...@gmail.com> wrote:
> > I think @Robert Metzger <rm...@apache.org> is right. You need to
> check
> > whether your Kubernetes APIServer is working properly or not(e.g.
> > overloaded).
> >
> > Another hint is about the fullGC. Please use the following config option
> to
> > enable the GC logs and check the full gc time.
> > env.java.opts.jobmanager: -verbose:gc -XX:+PrintGCDetails
> > -XX:+PrintGCDateStamps -Xloggc:/opt/flink/log/jobmanager-gc.log
> >
> > Simply increasing the renew-deadline might help. But it could not solve
> the
> > problem completely.
> > high-availability.kubernetes.leader-election.lease-duration: 120 s
> > high-availability.kubernetes.leader-election.renew-deadline: 120 s
> >
> >
> > Best,
> > Yang
> >
> > Robert Metzger <rm...@apache.org> 于2021年9月9日周四 下午6:52写道:
> >
> > > Is the kubernetes server you are using particularly busy? Maybe these
> > > issues occur because the server is overloaded?
> > >
> > > "Triggering checkpoint 2193 (type=CHECKPOINT) @ 1630681482667 for job
> > > 00000000000000000000000000000000."
> > > "Completed checkpoint 2193 for job 00000000000000000000000000000000
> (474
> > > bytes in 195 ms)."
> > > "Triggering checkpoint 2194 (type=CHECKPOINT) @ 1630681492667 for job
> > > 00000000000000000000000000000000."
> > > "Completed checkpoint 2194 for job 00000000000000000000000000000000
> (474
> > > bytes in 161 ms)."
> > > "Renew deadline reached after 60 seconds while renewing lock
> > > ConfigMapLock: myNs - myJob-dispatcher-leader
> > > (1bcda6b0-8a5a-4969-b9e4-2257c4478572)"
> > > "Stopping SessionDispatcherLeaderProcess."
> > >
> > > At some point, the leader election mechanism in fabric8 seems to give
> up.
> > >
> > >
> > > On Tue, Sep 7, 2021 at 10:05 AM mejri houssem <
> mejrihoussem09@gmail.com>
> > > wrote:
> > >
> > >> hello,
> > >>
> > >> Here's other logs of the latest jm crash.
> > >>
> > >>
> > >> Le lun. 6 sept. 2021 à 14:18, houssem <me...@gmail.com> a
> > >> écrit :
> > >>
> > >>> hello,
> > >>>
> > >>> I have three jobs running on my kubernetes cluster and each job has
> his
> > >>> own cluster id.
> > >>>
> > >>> On 2021/09/06 03:28:10, Yangze Guo <ka...@gmail.com> wrote:
> > >>> > Hi,
> > >>> >
> > >>> > The root cause is not "java.lang.NoClassDefFound". The job has been
> > >>> > running but could not edit the config map
> > >>> > "myJob-00000000000000000000000000000000-jobmanager-leader" and it
> > >>> > seems finally disconnected with the API server. Is there another
> job
> > >>> > with the same cluster id (myJob) ?
> > >>> >
> > >>> > I would also pull Yang Wang.
> > >>> >
> > >>> > Best,
> > >>> > Yangze Guo
> > >>> >
> > >>> > On Mon, Sep 6, 2021 at 10:10 AM Caizhi Weng <ts...@gmail.com>
> > >>> wrote:
> > >>> > >
> > >>> > > Hi!
> > >>> > >
> > >>> > > There is a message saying "java.lang.NoClassDefFound Error:
> > >>> org/apache/hadoop/hdfs/HdfsConfiguration" in your log file. Are you
> > >>> visiting HDFS in your job? If yes it seems that your Flink
> distribution or
> > >>> your cluster is lacking hadoop classes. Please make sure that there
> are
> > >>> hadoop jars in the lib directory of Flink, or your cluster has set
> the
> > >>> HADOOP_CLASSPATH environment variable.
> > >>> > >
> > >>> > > mejri houssem <me...@gmail.com> 于2021年9月4日周六 上午12:15写道:
> > >>> > >>
> > >>> > >>
> > >>> > >> Hello ,
> > >>> > >>
> > >>> > >> I am facing a JM crash lately. I am deploying a flink
> application
> > >>> cluster on kubernetes.
> > >>> > >>
> > >>> > >> When i install my chart using helm everything works fine but
> after
> > >>> some time ,the Jm starts to crash
> > >>> > >>
> > >>> > >> and then it gets deleted eventually after 5 restarts.
> > >>> > >>
> > >>> > >> flink version: 1.12.5 (upgraded recently from 1.12.2)
> > >>> > >> HA mode : k8s
> > >>> > >>
> > >>> > >> Here's the full log of the JM attached file.
> > >>> >
> > >>>
> > >>
> >
>

Re: Job manager crash

Posted by houssem <me...@gmail.com>.
Hello ,

with respect to the api-server i dotn re

On 2021/09/09 11:37:49, Yang Wang <da...@gmail.com> wrote: 
> I think @Robert Metzger <rm...@apache.org> is right. You need to check
> whether your Kubernetes APIServer is working properly or not(e.g.
> overloaded).
> 
> Another hint is about the fullGC. Please use the following config option to
> enable the GC logs and check the full gc time.
> env.java.opts.jobmanager: -verbose:gc -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps -Xloggc:/opt/flink/log/jobmanager-gc.log
> 
> Simply increasing the renew-deadline might help. But it could not solve the
> problem completely.
> high-availability.kubernetes.leader-election.lease-duration: 120 s
> high-availability.kubernetes.leader-election.renew-deadline: 120 s
> 
> 
> Best,
> Yang
> 
> Robert Metzger <rm...@apache.org> 于2021年9月9日周四 下午6:52写道:
> 
> > Is the kubernetes server you are using particularly busy? Maybe these
> > issues occur because the server is overloaded?
> >
> > "Triggering checkpoint 2193 (type=CHECKPOINT) @ 1630681482667 for job
> > 00000000000000000000000000000000."
> > "Completed checkpoint 2193 for job 00000000000000000000000000000000 (474
> > bytes in 195 ms)."
> > "Triggering checkpoint 2194 (type=CHECKPOINT) @ 1630681492667 for job
> > 00000000000000000000000000000000."
> > "Completed checkpoint 2194 for job 00000000000000000000000000000000 (474
> > bytes in 161 ms)."
> > "Renew deadline reached after 60 seconds while renewing lock
> > ConfigMapLock: myNs - myJob-dispatcher-leader
> > (1bcda6b0-8a5a-4969-b9e4-2257c4478572)"
> > "Stopping SessionDispatcherLeaderProcess."
> >
> > At some point, the leader election mechanism in fabric8 seems to give up.
> >
> >
> > On Tue, Sep 7, 2021 at 10:05 AM mejri houssem <me...@gmail.com>
> > wrote:
> >
> >> hello,
> >>
> >> Here's other logs of the latest jm crash.
> >>
> >>
> >> Le lun. 6 sept. 2021 à 14:18, houssem <me...@gmail.com> a
> >> écrit :
> >>
> >>> hello,
> >>>
> >>> I have three jobs running on my kubernetes cluster and each job has his
> >>> own cluster id.
> >>>
> >>> On 2021/09/06 03:28:10, Yangze Guo <ka...@gmail.com> wrote:
> >>> > Hi,
> >>> >
> >>> > The root cause is not "java.lang.NoClassDefFound". The job has been
> >>> > running but could not edit the config map
> >>> > "myJob-00000000000000000000000000000000-jobmanager-leader" and it
> >>> > seems finally disconnected with the API server. Is there another job
> >>> > with the same cluster id (myJob) ?
> >>> >
> >>> > I would also pull Yang Wang.
> >>> >
> >>> > Best,
> >>> > Yangze Guo
> >>> >
> >>> > On Mon, Sep 6, 2021 at 10:10 AM Caizhi Weng <ts...@gmail.com>
> >>> wrote:
> >>> > >
> >>> > > Hi!
> >>> > >
> >>> > > There is a message saying "java.lang.NoClassDefFound Error:
> >>> org/apache/hadoop/hdfs/HdfsConfiguration" in your log file. Are you
> >>> visiting HDFS in your job? If yes it seems that your Flink distribution or
> >>> your cluster is lacking hadoop classes. Please make sure that there are
> >>> hadoop jars in the lib directory of Flink, or your cluster has set the
> >>> HADOOP_CLASSPATH environment variable.
> >>> > >
> >>> > > mejri houssem <me...@gmail.com> 于2021年9月4日周六 上午12:15写道:
> >>> > >>
> >>> > >>
> >>> > >> Hello ,
> >>> > >>
> >>> > >> I am facing a JM crash lately. I am deploying a flink application
> >>> cluster on kubernetes.
> >>> > >>
> >>> > >> When i install my chart using helm everything works fine but after
> >>> some time ,the Jm starts to crash
> >>> > >>
> >>> > >> and then it gets deleted eventually after 5 restarts.
> >>> > >>
> >>> > >> flink version: 1.12.5 (upgraded recently from 1.12.2)
> >>> > >> HA mode : k8s
> >>> > >>
> >>> > >> Here's the full log of the JM attached file.
> >>> >
> >>>
> >>
> 

Re: Job manager crash

Posted by Yang Wang <da...@gmail.com>.
The GC log looks quite normal. Maybe the K8s APIServer is overloaded.

Best,
Yang

houssem <me...@gmail.com> 于2021年9月13日周一 下午5:11写道:

> hello,
>
> here's some of full GC log:
>
> OpenJDK 64-Bit Server VM (25.232-b09) for linux-amd64 JRE (1.8.0_232-b09),
> built on Oct 18 2019 15:04:46 by "jenkins" with gcc 4.8.2 20140120 (Red Hat
> 4.8.2-15)
> Memory: 4k page, physical 976560k(946672k free), swap 0k(0k free)
> CommandLine flags: -XX:CompressedClassSpaceSize=260046848
> -XX:InitialHeapSize=1073741824 -XX:MaxHeapSize=1073741824
> -XX:MaxMetaspaceSize=268435456 -XX:+PrintGC -XX:+PrintGCDateStamps
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedClassPointers
> -XX:+UseCompressedOops
> 2021-09-13T09:28:11.569+0200: 3.516: [Full GC (Metadata GC Threshold)
> 2021-09-13T09:28:11.569+0200: 3.516: [Tenured: 0K->12699K(699072K),
> 0.0986073 secs] 67116K->12699K(1013632K), [Metaspace:
> 20705K->20705K(1067008K)], 0.0987201 secs] [Times: user=0.03 sys=0.02,
> real=0.10 secs]
> 2021-09-13T09:28:15.560+0200: 7.507: [Full GC (Metadata GC Threshold)
> 2021-09-13T09:28:15.560+0200: 7.507: [Tenured: 12699K->24229K(699072K),
> 0.2937536 secs] 105133K->24229K(1013632K), [Metaspace:
> 33805K->33805K(1079296K)], 0.2938554 secs] [Times: user=0.13 sys=0.00,
> real=0.29 secs]
> 2021-09-13T09:28:22.744+0200: 14.691: [Full GC (Metadata GC Threshold)
> 2021-09-13T09:28:22.744+0200: 14.691: [Tenured: 24229K->50182K(699072K),
> 0.2362689 secs] 187184K->50182K(1013632K), [Metaspace:
> 56762K->56762K(1099776K)], 0.2363739 secs] [Times: user=0.11 sys=0.02,
> real=0.24 secs]
> 2021-09-13T09:31:50.257+0200: 222.204: [GC (Allocation Failure)
> 2021-09-13T09:31:50.257+0200: 222.204: [DefNew: 279616K->20089K(314560K),
> 0.1042210 secs] 329798K->70271K(1013632K), 0.1043736 secs] [Times:
> user=0.04 sys=0.03, real=0.10 secs]
> 2021-09-13T09:40:32.456+0200: 744.403: [GC (Allocation Failure)
> 2021-09-13T09:40:32.456+0200: 744.403: [DefNew: 299705K->435K(314560K),
> 0.0255928 secs] 349887K->56275K(1013632K), 0.0257074 secs] [Times:
> user=0.02 sys=0.01, real=0.03 secs]
> 2021-09-13T09:50:41.809+0200: 1353.756: [GC (Allocation Failure)
> 2021-09-13T09:50:41.809+0200: 1353.756: [DefNew: 280051K->551K(314560K),
> 0.0089400 secs] 335891K->56391K(1013632K), 0.0090356 secs] [Times:
> user=0.01 sys=0.00, real=0.01 secs]
> 2021-09-13T10:01:33.109+0200: 2005.056: [GC (Allocation Failure)
> 2021-09-13T10:01:33.109+0200: 2005.056: [DefNew: 280167K->707K(314560K),
> 0.0099544 secs] 336007K->56547K(1013632K), 0.0100724 secs] [Times:
> user=0.00 sys=0.00, real=0.01 secs]
> 2021-09-13T10:11:53.384+0200: 2625.331: [GC (Allocation Failure)
> 2021-09-13T10:11:53.384+0200: 2625.331: [DefNew: 280323K->857K(314560K),
> 0.0095649 secs] 336163K->56697K(1013632K), 0.0096763 secs] [Times:
> user=0.01 sys=0.00, real=0.01 secs]
> 2021-09-13T10:21:31.798+0200: 3203.745: [GC (Allocation Failure)
> 2021-09-13T10:21:31.798+0200: 3203.745: [DefNew: 280473K->945K(314560K),
> 0.0085233 secs] 336313K->56785K(1013632K), 0.0086403 secs] [Times:
> user=0.01 sys=0.00, real=0.01 secs]
> 2021-09-13T10:31:44.561+0200: 3816.508: [GC (Allocation Failure)
> 2021-09-13T10:31:44.561+0200: 3816.508: [DefNew: 280561K->1053K(314560K),
> 0.0103383 secs] 336401K->56893K(1013632K), 0.0104447 secs] [Times:
> user=0.01 sys=0.00, real=0.01 secs]
> 2021-09-13T10:41:51.289+0200: 4423.236: [GC (Allocation Failure)
> 2021-09-13T10:41:51.289+0200: 4423.236: [DefNew: 280669K->1009K(314560K),
> 0.0100803 secs] 336509K->56849K(1013632K), 0.0101961 secs] [Times:
> user=0.01 sys=0.00, real=0.01 secs]
> 2021-09-13T10:52:13.378+0200: 5045.325: [GC (Allocation Failure)
> 2021-09-13T10:52:13.378+0200: 5045.325: [DefNew: 280625K->1266K(314560K),
> 0.0091235 secs] 336465K->57106K(1013632K), 0.0092590 secs] [Times:
> user=0.00 sys=0.01, real=0.01 secs]
> 2021-09-13T11:02:20.253+0200: 5652.200: [GC (Allocation Failure)
> 2021-09-13T11:02:20.253+0200: 5652.200: [DefNew: 280882K->1323K(314560K),
> 0.0097592 secs] 336722K->57163K(1013632K), 0.0098574 secs] [Times:
> user=0.01 sys=0.00, real=0.01 secs]
>
> ************************************************************
>
> and here's my flink-conf.yaml file
> taskmanager.numberOfTaskSlots: 2
> blob.server.port: 6124
> jobmanager.rpc.port: 6123
> taskmanager.rpc.port: 6122
> queryable-state.proxy.ports: 6125
> jobmanager.memory.process.size: 1600m
> taskmanager.memory.process.size: 1728m
> parallelism.default: 2
>
> #HA K8S
> kubernetes.cluster-id: myJob
> high-availability:
> org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
> high-availability.storageDir: s3://flink-data-integ/data/flink-ha/myJob
> kubernetes.namespace: flink-pushavoo-flink-rec
> high-availability.kubernetes.leader-election.lease-duration: 60 s
> high-availability.kubernetes.leader-election.renew-deadline: 60 s
>
> restart-strategy: fixed-delay
> restart-strategy.fixed-delay.attempts: 10
>
> #Checkpoints
> state.backend: filesystem
> state.checkpoints.dir: s3://flink-data/data/checkpoints/myJob
> state.checkpoints.num-retained: 10
>
> #flink-prometheus
> metrics.reporters: prometheus
> metrics.reporter.prometheus.class:
> org.apache.flink.metrics.prometheus.PrometheusReporter
> metrics.reporter.prometheus.port: 9249
>
> #logback
> classloader.parent-first-patterns.additional: net.logstash.logback
>
> #S3
> s3.endpoint: *******
> s3.access-key: ********
> s3.secret-key: ******
> env.java.opts.jobmanager: -verbose:gc -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps -Xloggc:/opt/flink/log/jobmanager-gc.log
>
>
>

Re: Job manager crash

Posted by houssem <me...@gmail.com>.
hello,

here's some of full GC log: 

OpenJDK 64-Bit Server VM (25.232-b09) for linux-amd64 JRE (1.8.0_232-b09), built on Oct 18 2019 15:04:46 by "jenkins" with gcc 4.8.2 20140120 (Red Hat 4.8.2-15)
Memory: 4k page, physical 976560k(946672k free), swap 0k(0k free)
CommandLine flags: -XX:CompressedClassSpaceSize=260046848 -XX:InitialHeapSize=1073741824 -XX:MaxHeapSize=1073741824 -XX:MaxMetaspaceSize=268435456 -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedClassPointers -XX:+UseCompressedOops 
2021-09-13T09:28:11.569+0200: 3.516: [Full GC (Metadata GC Threshold) 2021-09-13T09:28:11.569+0200: 3.516: [Tenured: 0K->12699K(699072K), 0.0986073 secs] 67116K->12699K(1013632K), [Metaspace: 20705K->20705K(1067008K)], 0.0987201 secs] [Times: user=0.03 sys=0.02, real=0.10 secs] 
2021-09-13T09:28:15.560+0200: 7.507: [Full GC (Metadata GC Threshold) 2021-09-13T09:28:15.560+0200: 7.507: [Tenured: 12699K->24229K(699072K), 0.2937536 secs] 105133K->24229K(1013632K), [Metaspace: 33805K->33805K(1079296K)], 0.2938554 secs] [Times: user=0.13 sys=0.00, real=0.29 secs] 
2021-09-13T09:28:22.744+0200: 14.691: [Full GC (Metadata GC Threshold) 2021-09-13T09:28:22.744+0200: 14.691: [Tenured: 24229K->50182K(699072K), 0.2362689 secs] 187184K->50182K(1013632K), [Metaspace: 56762K->56762K(1099776K)], 0.2363739 secs] [Times: user=0.11 sys=0.02, real=0.24 secs] 
2021-09-13T09:31:50.257+0200: 222.204: [GC (Allocation Failure) 2021-09-13T09:31:50.257+0200: 222.204: [DefNew: 279616K->20089K(314560K), 0.1042210 secs] 329798K->70271K(1013632K), 0.1043736 secs] [Times: user=0.04 sys=0.03, real=0.10 secs] 
2021-09-13T09:40:32.456+0200: 744.403: [GC (Allocation Failure) 2021-09-13T09:40:32.456+0200: 744.403: [DefNew: 299705K->435K(314560K), 0.0255928 secs] 349887K->56275K(1013632K), 0.0257074 secs] [Times: user=0.02 sys=0.01, real=0.03 secs] 
2021-09-13T09:50:41.809+0200: 1353.756: [GC (Allocation Failure) 2021-09-13T09:50:41.809+0200: 1353.756: [DefNew: 280051K->551K(314560K), 0.0089400 secs] 335891K->56391K(1013632K), 0.0090356 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2021-09-13T10:01:33.109+0200: 2005.056: [GC (Allocation Failure) 2021-09-13T10:01:33.109+0200: 2005.056: [DefNew: 280167K->707K(314560K), 0.0099544 secs] 336007K->56547K(1013632K), 0.0100724 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] 
2021-09-13T10:11:53.384+0200: 2625.331: [GC (Allocation Failure) 2021-09-13T10:11:53.384+0200: 2625.331: [DefNew: 280323K->857K(314560K), 0.0095649 secs] 336163K->56697K(1013632K), 0.0096763 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2021-09-13T10:21:31.798+0200: 3203.745: [GC (Allocation Failure) 2021-09-13T10:21:31.798+0200: 3203.745: [DefNew: 280473K->945K(314560K), 0.0085233 secs] 336313K->56785K(1013632K), 0.0086403 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2021-09-13T10:31:44.561+0200: 3816.508: [GC (Allocation Failure) 2021-09-13T10:31:44.561+0200: 3816.508: [DefNew: 280561K->1053K(314560K), 0.0103383 secs] 336401K->56893K(1013632K), 0.0104447 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2021-09-13T10:41:51.289+0200: 4423.236: [GC (Allocation Failure) 2021-09-13T10:41:51.289+0200: 4423.236: [DefNew: 280669K->1009K(314560K), 0.0100803 secs] 336509K->56849K(1013632K), 0.0101961 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2021-09-13T10:52:13.378+0200: 5045.325: [GC (Allocation Failure) 2021-09-13T10:52:13.378+0200: 5045.325: [DefNew: 280625K->1266K(314560K), 0.0091235 secs] 336465K->57106K(1013632K), 0.0092590 secs] [Times: user=0.00 sys=0.01, real=0.01 secs] 
2021-09-13T11:02:20.253+0200: 5652.200: [GC (Allocation Failure) 2021-09-13T11:02:20.253+0200: 5652.200: [DefNew: 280882K->1323K(314560K), 0.0097592 secs] 336722K->57163K(1013632K), 0.0098574 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 

************************************************************

and here's my flink-conf.yaml file
taskmanager.numberOfTaskSlots: 2
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
queryable-state.proxy.ports: 6125
jobmanager.memory.process.size: 1600m
taskmanager.memory.process.size: 1728m
parallelism.default: 2

#HA K8S
kubernetes.cluster-id: myJob
high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
high-availability.storageDir: s3://flink-data-integ/data/flink-ha/myJob
kubernetes.namespace: flink-pushavoo-flink-rec
high-availability.kubernetes.leader-election.lease-duration: 60 s
high-availability.kubernetes.leader-election.renew-deadline: 60 s

restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 10

#Checkpoints
state.backend: filesystem
state.checkpoints.dir: s3://flink-data/data/checkpoints/myJob
state.checkpoints.num-retained: 10

#flink-prometheus 
metrics.reporters: prometheus
metrics.reporter.prometheus.class: org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prometheus.port: 9249

#logback
classloader.parent-first-patterns.additional: net.logstash.logback

#S3
s3.endpoint: *******
s3.access-key: ********
s3.secret-key: ******
env.java.opts.jobmanager: -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/opt/flink/log/jobmanager-gc.log



Re: Job manager crash

Posted by Yang Wang <da...@gmail.com>.
I think @Robert Metzger <rm...@apache.org> is right. You need to check
whether your Kubernetes APIServer is working properly or not(e.g.
overloaded).

Another hint is about the fullGC. Please use the following config option to
enable the GC logs and check the full gc time.
env.java.opts.jobmanager: -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -Xloggc:/opt/flink/log/jobmanager-gc.log

Simply increasing the renew-deadline might help. But it could not solve the
problem completely.
high-availability.kubernetes.leader-election.lease-duration: 120 s
high-availability.kubernetes.leader-election.renew-deadline: 120 s


Best,
Yang

Robert Metzger <rm...@apache.org> 于2021年9月9日周四 下午6:52写道:

> Is the kubernetes server you are using particularly busy? Maybe these
> issues occur because the server is overloaded?
>
> "Triggering checkpoint 2193 (type=CHECKPOINT) @ 1630681482667 for job
> 00000000000000000000000000000000."
> "Completed checkpoint 2193 for job 00000000000000000000000000000000 (474
> bytes in 195 ms)."
> "Triggering checkpoint 2194 (type=CHECKPOINT) @ 1630681492667 for job
> 00000000000000000000000000000000."
> "Completed checkpoint 2194 for job 00000000000000000000000000000000 (474
> bytes in 161 ms)."
> "Renew deadline reached after 60 seconds while renewing lock
> ConfigMapLock: myNs - myJob-dispatcher-leader
> (1bcda6b0-8a5a-4969-b9e4-2257c4478572)"
> "Stopping SessionDispatcherLeaderProcess."
>
> At some point, the leader election mechanism in fabric8 seems to give up.
>
>
> On Tue, Sep 7, 2021 at 10:05 AM mejri houssem <me...@gmail.com>
> wrote:
>
>> hello,
>>
>> Here's other logs of the latest jm crash.
>>
>>
>> Le lun. 6 sept. 2021 à 14:18, houssem <me...@gmail.com> a
>> écrit :
>>
>>> hello,
>>>
>>> I have three jobs running on my kubernetes cluster and each job has his
>>> own cluster id.
>>>
>>> On 2021/09/06 03:28:10, Yangze Guo <ka...@gmail.com> wrote:
>>> > Hi,
>>> >
>>> > The root cause is not "java.lang.NoClassDefFound". The job has been
>>> > running but could not edit the config map
>>> > "myJob-00000000000000000000000000000000-jobmanager-leader" and it
>>> > seems finally disconnected with the API server. Is there another job
>>> > with the same cluster id (myJob) ?
>>> >
>>> > I would also pull Yang Wang.
>>> >
>>> > Best,
>>> > Yangze Guo
>>> >
>>> > On Mon, Sep 6, 2021 at 10:10 AM Caizhi Weng <ts...@gmail.com>
>>> wrote:
>>> > >
>>> > > Hi!
>>> > >
>>> > > There is a message saying "java.lang.NoClassDefFound Error:
>>> org/apache/hadoop/hdfs/HdfsConfiguration" in your log file. Are you
>>> visiting HDFS in your job? If yes it seems that your Flink distribution or
>>> your cluster is lacking hadoop classes. Please make sure that there are
>>> hadoop jars in the lib directory of Flink, or your cluster has set the
>>> HADOOP_CLASSPATH environment variable.
>>> > >
>>> > > mejri houssem <me...@gmail.com> 于2021年9月4日周六 上午12:15写道:
>>> > >>
>>> > >>
>>> > >> Hello ,
>>> > >>
>>> > >> I am facing a JM crash lately. I am deploying a flink application
>>> cluster on kubernetes.
>>> > >>
>>> > >> When i install my chart using helm everything works fine but after
>>> some time ,the Jm starts to crash
>>> > >>
>>> > >> and then it gets deleted eventually after 5 restarts.
>>> > >>
>>> > >> flink version: 1.12.5 (upgraded recently from 1.12.2)
>>> > >> HA mode : k8s
>>> > >>
>>> > >> Here's the full log of the JM attached file.
>>> >
>>>
>>

Re: Job manager crash

Posted by Robert Metzger <rm...@apache.org>.
Is the kubernetes server you are using particularly busy? Maybe these
issues occur because the server is overloaded?

"Triggering checkpoint 2193 (type=CHECKPOINT) @ 1630681482667 for job
00000000000000000000000000000000."
"Completed checkpoint 2193 for job 00000000000000000000000000000000 (474
bytes in 195 ms)."
"Triggering checkpoint 2194 (type=CHECKPOINT) @ 1630681492667 for job
00000000000000000000000000000000."
"Completed checkpoint 2194 for job 00000000000000000000000000000000 (474
bytes in 161 ms)."
"Renew deadline reached after 60 seconds while renewing lock ConfigMapLock:
myNs - myJob-dispatcher-leader (1bcda6b0-8a5a-4969-b9e4-2257c4478572)"
"Stopping SessionDispatcherLeaderProcess."

At some point, the leader election mechanism in fabric8 seems to give up.


On Tue, Sep 7, 2021 at 10:05 AM mejri houssem <me...@gmail.com>
wrote:

> hello,
>
> Here's other logs of the latest jm crash.
>
>
> Le lun. 6 sept. 2021 à 14:18, houssem <me...@gmail.com> a écrit :
>
>> hello,
>>
>> I have three jobs running on my kubernetes cluster and each job has his
>> own cluster id.
>>
>> On 2021/09/06 03:28:10, Yangze Guo <ka...@gmail.com> wrote:
>> > Hi,
>> >
>> > The root cause is not "java.lang.NoClassDefFound". The job has been
>> > running but could not edit the config map
>> > "myJob-00000000000000000000000000000000-jobmanager-leader" and it
>> > seems finally disconnected with the API server. Is there another job
>> > with the same cluster id (myJob) ?
>> >
>> > I would also pull Yang Wang.
>> >
>> > Best,
>> > Yangze Guo
>> >
>> > On Mon, Sep 6, 2021 at 10:10 AM Caizhi Weng <ts...@gmail.com>
>> wrote:
>> > >
>> > > Hi!
>> > >
>> > > There is a message saying "java.lang.NoClassDefFound Error:
>> org/apache/hadoop/hdfs/HdfsConfiguration" in your log file. Are you
>> visiting HDFS in your job? If yes it seems that your Flink distribution or
>> your cluster is lacking hadoop classes. Please make sure that there are
>> hadoop jars in the lib directory of Flink, or your cluster has set the
>> HADOOP_CLASSPATH environment variable.
>> > >
>> > > mejri houssem <me...@gmail.com> 于2021年9月4日周六 上午12:15写道:
>> > >>
>> > >>
>> > >> Hello ,
>> > >>
>> > >> I am facing a JM crash lately. I am deploying a flink application
>> cluster on kubernetes.
>> > >>
>> > >> When i install my chart using helm everything works fine but after
>> some time ,the Jm starts to crash
>> > >>
>> > >> and then it gets deleted eventually after 5 restarts.
>> > >>
>> > >> flink version: 1.12.5 (upgraded recently from 1.12.2)
>> > >> HA mode : k8s
>> > >>
>> > >> Here's the full log of the JM attached file.
>> >
>>
>

Re: Job manager crash

Posted by mejri houssem <me...@gmail.com>.
hello,

Here's other logs of the latest jm crash.


Le lun. 6 sept. 2021 à 14:18, houssem <me...@gmail.com> a écrit :

> hello,
>
> I have three jobs running on my kubernetes cluster and each job has his
> own cluster id.
>
> On 2021/09/06 03:28:10, Yangze Guo <ka...@gmail.com> wrote:
> > Hi,
> >
> > The root cause is not "java.lang.NoClassDefFound". The job has been
> > running but could not edit the config map
> > "myJob-00000000000000000000000000000000-jobmanager-leader" and it
> > seems finally disconnected with the API server. Is there another job
> > with the same cluster id (myJob) ?
> >
> > I would also pull Yang Wang.
> >
> > Best,
> > Yangze Guo
> >
> > On Mon, Sep 6, 2021 at 10:10 AM Caizhi Weng <ts...@gmail.com>
> wrote:
> > >
> > > Hi!
> > >
> > > There is a message saying "java.lang.NoClassDefFound Error:
> org/apache/hadoop/hdfs/HdfsConfiguration" in your log file. Are you
> visiting HDFS in your job? If yes it seems that your Flink distribution or
> your cluster is lacking hadoop classes. Please make sure that there are
> hadoop jars in the lib directory of Flink, or your cluster has set the
> HADOOP_CLASSPATH environment variable.
> > >
> > > mejri houssem <me...@gmail.com> 于2021年9月4日周六 上午12:15写道:
> > >>
> > >>
> > >> Hello ,
> > >>
> > >> I am facing a JM crash lately. I am deploying a flink application
> cluster on kubernetes.
> > >>
> > >> When i install my chart using helm everything works fine but after
> some time ,the Jm starts to crash
> > >>
> > >> and then it gets deleted eventually after 5 restarts.
> > >>
> > >> flink version: 1.12.5 (upgraded recently from 1.12.2)
> > >> HA mode : k8s
> > >>
> > >> Here's the full log of the JM attached file.
> >
>

Re: Job manager crash

Posted by houssem <me...@gmail.com>.
hello,

I have three jobs running on my kubernetes cluster and each job has his own cluster id.

On 2021/09/06 03:28:10, Yangze Guo <ka...@gmail.com> wrote: 
> Hi,
> 
> The root cause is not "java.lang.NoClassDefFound". The job has been
> running but could not edit the config map
> "myJob-00000000000000000000000000000000-jobmanager-leader" and it
> seems finally disconnected with the API server. Is there another job
> with the same cluster id (myJob) ?
> 
> I would also pull Yang Wang.
> 
> Best,
> Yangze Guo
> 
> On Mon, Sep 6, 2021 at 10:10 AM Caizhi Weng <ts...@gmail.com> wrote:
> >
> > Hi!
> >
> > There is a message saying "java.lang.NoClassDefFound Error: org/apache/hadoop/hdfs/HdfsConfiguration" in your log file. Are you visiting HDFS in your job? If yes it seems that your Flink distribution or your cluster is lacking hadoop classes. Please make sure that there are hadoop jars in the lib directory of Flink, or your cluster has set the HADOOP_CLASSPATH environment variable.
> >
> > mejri houssem <me...@gmail.com> 于2021年9月4日周六 上午12:15写道:
> >>
> >>
> >> Hello ,
> >>
> >> I am facing a JM crash lately. I am deploying a flink application cluster on kubernetes.
> >>
> >> When i install my chart using helm everything works fine but after some time ,the Jm starts to crash
> >>
> >> and then it gets deleted eventually after 5 restarts.
> >>
> >> flink version: 1.12.5 (upgraded recently from 1.12.2)
> >> HA mode : k8s
> >>
> >> Here's the full log of the JM attached file.
> 

Re: Job manager crash

Posted by Yangze Guo <ka...@gmail.com>.
Hi,

The root cause is not "java.lang.NoClassDefFound". The job has been
running but could not edit the config map
"myJob-00000000000000000000000000000000-jobmanager-leader" and it
seems finally disconnected with the API server. Is there another job
with the same cluster id (myJob) ?

I would also pull Yang Wang.

Best,
Yangze Guo

On Mon, Sep 6, 2021 at 10:10 AM Caizhi Weng <ts...@gmail.com> wrote:
>
> Hi!
>
> There is a message saying "java.lang.NoClassDefFound Error: org/apache/hadoop/hdfs/HdfsConfiguration" in your log file. Are you visiting HDFS in your job? If yes it seems that your Flink distribution or your cluster is lacking hadoop classes. Please make sure that there are hadoop jars in the lib directory of Flink, or your cluster has set the HADOOP_CLASSPATH environment variable.
>
> mejri houssem <me...@gmail.com> 于2021年9月4日周六 上午12:15写道:
>>
>>
>> Hello ,
>>
>> I am facing a JM crash lately. I am deploying a flink application cluster on kubernetes.
>>
>> When i install my chart using helm everything works fine but after some time ,the Jm starts to crash
>>
>> and then it gets deleted eventually after 5 restarts.
>>
>> flink version: 1.12.5 (upgraded recently from 1.12.2)
>> HA mode : k8s
>>
>> Here's the full log of the JM attached file.

Re: Job manager crash

Posted by Caizhi Weng <ts...@gmail.com>.
Hi!

There is a message saying "java.lang.NoClassDefFound Error:
org/apache/hadoop/hdfs/HdfsConfiguration" in your log file. Are you
visiting HDFS in your job? If yes it seems that your Flink distribution or
your cluster is lacking hadoop classes. Please make sure that there are
hadoop jars in the lib directory of Flink, or your cluster has set the
HADOOP_CLASSPATH environment variable.

mejri houssem <me...@gmail.com> 于2021年9月4日周六 上午12:15写道:

>
> Hello ,
>
> I am facing a JM crash lately. I am deploying a flink application cluster
> on kubernetes.
>
> When i install my chart using helm everything works fine but after some
> time ,the Jm starts to crash
>
> and then it gets deleted eventually after 5 restarts.
>
> flink version: 1.12.5 (upgraded recently from 1.12.2)
> HA mode : k8s
>
> Here's the full log of the JM attached file.
>