You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "chenyuzhi (Jira)" <ji...@apache.org> on 2024/04/22 08:19:00 UTC
[jira] [Comment Edited] (FLINK-35192) operator oom
[ https://issues.apache.org/jira/browse/FLINK-35192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839545#comment-17839545 ]
chenyuzhi edited comment on FLINK-35192 at 4/22/24 8:18 AM:
------------------------------------------------------------
the yaml spec:
{code:yaml}
apiVersion: apps/v1
kind: Deployment
metadata: annotations: deployment.kubernetes.io/revision: "3"
meta.helm.sh/release-name: flink-kubernetes-operator
meta.helm.sh/release-namespace: streamfly
creationTimestamp: "2024-03-13T02:55:09Z"
generation: 3
labels: app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: flink-kubernetes-operator
app.kubernetes.io/version: 1.6.1-GDC1.0.2
helm.sh/chart: flink-kubernetes-operator-1.6.1-GDC1.0.2
name: flink-kubernetes-operator
namespace: streamfly
resourceVersion: "8064936654"
uid: 00418b62-820f-4e4a-a138-1ff81f605787
spec: progressDeadlineSeconds: 600
replicas: 2
revisionHistoryLimit: 10
selector: matchLabels: app.kubernetes.io/name: flink-kubernetes-operator
strategy: rollingUpdate: maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template: metadata: annotations: kubectl.kubernetes.io/default-container: flink-kubernetes-operator
creationTimestamp: null
labels: app.kubernetes.io/name: flink-kubernetes-operator
spec: containers: - command: - /docker-entrypoint.sh
- operator
env: - name: OPERATOR_NAMESPACE
valueFrom: fieldRef: apiVersion: v1
fieldPath: metadata.namespace
- name: HOST_IP
valueFrom: fieldRef: apiVersion: v1
fieldPath: status.hostIP
- name: POD_IP
valueFrom: fieldRef: apiVersion: v1
fieldPath: status.podIP
- name: POD_NAME
valueFrom: fieldRef: apiVersion: v1
fieldPath: metadata.name
- name: OPERATOR_NAME
value: flink-kubernetes-operator
- name: FLINK_CONF_DIR
value: /opt/flink/conf
- name: FLINK_PLUGINS_DIR
value: /opt/flink/plugins
- name: LOG_CONFIG
value: -Dlog4j.configurationFile=/opt/flink/conf/log4j-operator.properties
- name: JVM_ARGS
value: -Xmx32g -Xms32g -XX:+UseG1GC
- name: TZ
value: Asia/Shanghai
image: ncr.nie.netease.com/v1-gdcstreaming/gdc-flink-kubernetes-operator:1.6.1-GDC1.0.2
imagePullPolicy: IfNotPresent
livenessProbe: failureThreshold: 3
httpGet: path: /
port: health-port
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: flink-kubernetes-operator
ports: - containerPort: 9999
name: metrics
protocol: TCP
- containerPort: 8085
name: health-port
protocol: TCP
resources: limits: cpu: "10"
memory: 35Gi
requests: cpu: "10"
memory: 35Gi
securityContext: {}
startupProbe: failureThreshold: 30
httpGet: path: /
port: health-port
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts: - mountPath: /opt/flink/conf
name: flink-operator-config-volume
- mountPath: /opt/scheduler/keytab
name: flink-operator-keytab-volume
- mountPath: /flink-data
name: flink-operator-logs-volume
- command: - /docker-entrypoint.sh
- webhook
env: - name: WEBHOOK_KEYSTORE_PASSWORD
valueFrom: secretKeyRef: key: password
name: flink-operator-webhook-secret
- name: WEBHOOK_KEYSTORE_FILE
value: /certs/keystore.p12
- name: WEBHOOK_KEYSTORE_TYPE
value: pkcs12
- name: WEBHOOK_SERVER_PORT
value: "9443"
- name: LOG_CONFIG
value: -Dlog4j.configurationFile=/opt/flink/conf/log4j-operator.properties
- name: JVM_ARGS
- name: FLINK_CONF_DIR
value: /opt/flink/conf
- name: FLINK_PLUGINS_DIR
value: /opt/flink/plugins
- name: OPERATOR_NAMESPACE
valueFrom: fieldRef: apiVersion: v1
fieldPath: metadata.namespace
image: ncr.nie.netease.com/v1-gdcstreaming/gdc-flink-kubernetes-operator:1.6.1-GDC1.0.2
imagePullPolicy: IfNotPresent
name: flink-webhook
resources: {}
securityContext: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts: - mountPath: /certs
name: keystore
readOnly: true
- mountPath: /opt/flink/conf
name: flink-operator-config-volume
dnsPolicy: ClusterFirst
imagePullSecrets: - name: ncr-pull-secret
nodeSelector: node-role.kubernetes.io/edge: ""
restartPolicy: Always
schedulerName: default-scheduler
securityContext: runAsGroup: 0
runAsUser: 0
serviceAccount: flink-operator
serviceAccountName: flink-operator
terminationGracePeriodSeconds: 30
volumes: - configMap: defaultMode: 420
items: - key: flink-conf.yaml
path: flink-conf.yaml
- key: log4j-operator.properties
path: log4j-operator.properties
- key: log4j-console.properties
path: log4j-console.properties
name: flink-operator-config
name: flink-operator-config-volume
- hostPath: path: /cfs/flink/keytab
type: Directory
name: flink-operator-keytab-volume
- hostPath: path: /home/k8s/logs
type: DirectoryOrCreate
name: flink-operator-logs-volume
- name: keystore
secret: defaultMode: 420
items: - key: keystore.p12
path: keystore.p12
secretName: webhook-server-cert
status: availableReplicas: 2
conditions: - lastTransitionTime: "2024-03-13T02:55:09Z"
lastUpdateTime: "2024-03-19T06:48:09Z"
message: ReplicaSet "flink-kubernetes-operator-7756945f7" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
- lastTransitionTime: "2024-04-19T04:08:21Z"
lastUpdateTime: "2024-04-19T04:08:21Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
observedGeneration: 3
readyReplicas: 2
replicas: 2
updatedReplicas: 2
{code}
and try to analyze the heap memory using MAT, here are the results of the analysis
!screenshot-2.png!
It seems to point to a bug in the jdk [deleteOnExit Api|https://bugs.openjdk.org/browse/JDK-4513817], but the theory from this bug is that it would result in not enough heap memory, whereas according to the jvm memory metrics, there is enough heap memory for an exception exit. It's strange
was (Author: stupid_pig):
the yaml spec:
{code:yaml}
apiVersion: apps/v1
kind: Deployment
metadata: annotations: deployment.kubernetes.io/revision: "3"
meta.helm.sh/release-name: flink-kubernetes-operator
meta.helm.sh/release-namespace: streamfly
creationTimestamp: "2024-03-13T02:55:09Z"
generation: 3
labels: app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: flink-kubernetes-operator
app.kubernetes.io/version: 1.6.1-GDC1.0.2
helm.sh/chart: flink-kubernetes-operator-1.6.1-GDC1.0.2
name: flink-kubernetes-operator
namespace: streamfly
resourceVersion: "8064936654"
uid: 00418b62-820f-4e4a-a138-1ff81f605787
spec: progressDeadlineSeconds: 600
replicas: 2
revisionHistoryLimit: 10
selector: matchLabels: app.kubernetes.io/name: flink-kubernetes-operator
strategy: rollingUpdate: maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template: metadata: annotations: kubectl.kubernetes.io/default-container: flink-kubernetes-operator
creationTimestamp: null
labels: app.kubernetes.io/name: flink-kubernetes-operator
spec: containers: - command: - /docker-entrypoint.sh
- operator
env: - name: OPERATOR_NAMESPACE
valueFrom: fieldRef: apiVersion: v1
fieldPath: metadata.namespace
- name: HOST_IP
valueFrom: fieldRef: apiVersion: v1
fieldPath: status.hostIP
- name: POD_IP
valueFrom: fieldRef: apiVersion: v1
fieldPath: status.podIP
- name: POD_NAME
valueFrom: fieldRef: apiVersion: v1
fieldPath: metadata.name
- name: OPERATOR_NAME
value: flink-kubernetes-operator
- name: FLINK_CONF_DIR
value: /opt/flink/conf
- name: FLINK_PLUGINS_DIR
value: /opt/flink/plugins
- name: LOG_CONFIG
value: -Dlog4j.configurationFile=/opt/flink/conf/log4j-operator.properties
- name: JVM_ARGS
value: -Xmx32g -Xms32g -XX:+UseG1GC
- name: TZ
value: Asia/Shanghai
image: ncr.nie.netease.com/v1-gdcstreaming/gdc-flink-kubernetes-operator:1.6.1-GDC1.0.2
imagePullPolicy: IfNotPresent
livenessProbe: failureThreshold: 3
httpGet: path: /
port: health-port
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: flink-kubernetes-operator
ports: - containerPort: 9999
name: metrics
protocol: TCP
- containerPort: 8085
name: health-port
protocol: TCP
resources: limits: cpu: "10"
memory: 35Gi
requests: cpu: "10"
memory: 35Gi
securityContext: {}
startupProbe: failureThreshold: 30
httpGet: path: /
port: health-port
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts: - mountPath: /opt/flink/conf
name: flink-operator-config-volume
- mountPath: /opt/scheduler/keytab
name: flink-operator-keytab-volume
- mountPath: /flink-data
name: flink-operator-logs-volume
- command: - /docker-entrypoint.sh
- webhook
env: - name: WEBHOOK_KEYSTORE_PASSWORD
valueFrom: secretKeyRef: key: password
name: flink-operator-webhook-secret
- name: WEBHOOK_KEYSTORE_FILE
value: /certs/keystore.p12
- name: WEBHOOK_KEYSTORE_TYPE
value: pkcs12
- name: WEBHOOK_SERVER_PORT
value: "9443"
- name: LOG_CONFIG
value: -Dlog4j.configurationFile=/opt/flink/conf/log4j-operator.properties
- name: JVM_ARGS
- name: FLINK_CONF_DIR
value: /opt/flink/conf
- name: FLINK_PLUGINS_DIR
value: /opt/flink/plugins
- name: OPERATOR_NAMESPACE
valueFrom: fieldRef: apiVersion: v1
fieldPath: metadata.namespace
image: ncr.nie.netease.com/v1-gdcstreaming/gdc-flink-kubernetes-operator:1.6.1-GDC1.0.2
imagePullPolicy: IfNotPresent
name: flink-webhook
resources: {}
securityContext: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts: - mountPath: /certs
name: keystore
readOnly: true
- mountPath: /opt/flink/conf
name: flink-operator-config-volume
dnsPolicy: ClusterFirst
imagePullSecrets: - name: ncr-pull-secret
nodeSelector: node-role.kubernetes.io/edge: ""
restartPolicy: Always
schedulerName: default-scheduler
securityContext: runAsGroup: 0
runAsUser: 0
serviceAccount: flink-operator
serviceAccountName: flink-operator
terminationGracePeriodSeconds: 30
volumes: - configMap: defaultMode: 420
items: - key: flink-conf.yaml
path: flink-conf.yaml
- key: log4j-operator.properties
path: log4j-operator.properties
- key: log4j-console.properties
path: log4j-console.properties
name: flink-operator-config
name: flink-operator-config-volume
- hostPath: path: /cfs/flink/keytab
type: Directory
name: flink-operator-keytab-volume
- hostPath: path: /home/k8s/logs
type: DirectoryOrCreate
name: flink-operator-logs-volume
- name: keystore
secret: defaultMode: 420
items: - key: keystore.p12
path: keystore.p12
secretName: webhook-server-cert
status: availableReplicas: 2
conditions: - lastTransitionTime: "2024-03-13T02:55:09Z"
lastUpdateTime: "2024-03-19T06:48:09Z"
message: ReplicaSet "flink-kubernetes-operator-7756945f7" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
- lastTransitionTime: "2024-04-19T04:08:21Z"
lastUpdateTime: "2024-04-19T04:08:21Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
observedGeneration: 3
readyReplicas: 2
replicas: 2
updatedReplicas: 2
{code}
and try to analyze the heap memory using MAT, here are the results of the analysis
!screenshot-2.png!
It seems to point to a bug in the jdk [deleteOnExit Api|https://bugs.openjdk.org/browse/JDK-4513817], but the theory from this bug is that it would result in not enough heap memory, whereas according to the jvm memory metrics, there is enough heap memory for an exception exit.
> operator oom
> ------------
>
> Key: FLINK-35192
> URL: https://issues.apache.org/jira/browse/FLINK-35192
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: kubernetes-operator-1.6.1
> Environment: jdk: openjdk11
> operator version: 1.6.1
> Reporter: chenyuzhi
> Priority: Major
> Attachments: image-2024-04-22-15-47-49-455.png, image-2024-04-22-15-52-51-600.png, image-2024-04-22-15-58-23-269.png, image-2024-04-22-15-58-42-850.png, screenshot-1.png, screenshot-2.png
>
>
> The kubernetest operator docker process was killed by kernel cause out of memory(the time is 2024.04.03: 18:16)
> !image-2024-04-22-15-47-49-455.png!
> Metrics:
> the pod memory (RSS) is increasing slowly in the past 7 days:
> !screenshot-1.png!
> However the jvm memory metrics of operator not shown obvious anomaly:
> !image-2024-04-22-15-58-23-269.png!
> !image-2024-04-22-15-58-42-850.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)