You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "chenyuzhi (Jira)" <ji...@apache.org> on 2024/04/22 08:19:00 UTC

[jira] [Comment Edited] (FLINK-35192) operator oom

    [ https://issues.apache.org/jira/browse/FLINK-35192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839545#comment-17839545 ] 

chenyuzhi edited comment on FLINK-35192 at 4/22/24 8:18 AM:
------------------------------------------------------------

the yaml spec:
{code:yaml}
apiVersion: apps/v1
kind: Deployment
metadata:   annotations:     deployment.kubernetes.io/revision: "3"
    meta.helm.sh/release-name: flink-kubernetes-operator
    meta.helm.sh/release-namespace: streamfly
  creationTimestamp: "2024-03-13T02:55:09Z"
  generation: 3
  labels:     app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: flink-kubernetes-operator
    app.kubernetes.io/version: 1.6.1-GDC1.0.2
    helm.sh/chart: flink-kubernetes-operator-1.6.1-GDC1.0.2
  name: flink-kubernetes-operator
  namespace: streamfly
  resourceVersion: "8064936654"
  uid: 00418b62-820f-4e4a-a138-1ff81f605787
spec:   progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:     matchLabels:       app.kubernetes.io/name: flink-kubernetes-operator
  strategy:     rollingUpdate:       maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:     metadata:       annotations:         kubectl.kubernetes.io/default-container: flink-kubernetes-operator
      creationTimestamp: null
      labels:         app.kubernetes.io/name: flink-kubernetes-operator
    spec:       containers:       - command:         - /docker-entrypoint.sh
        - operator
        env:         - name: OPERATOR_NAMESPACE
          valueFrom:             fieldRef:               apiVersion: v1
              fieldPath: metadata.namespace
        - name: HOST_IP
          valueFrom:             fieldRef:               apiVersion: v1
              fieldPath: status.hostIP
        - name: POD_IP
          valueFrom:             fieldRef:               apiVersion: v1
              fieldPath: status.podIP
        - name: POD_NAME
          valueFrom:             fieldRef:               apiVersion: v1
              fieldPath: metadata.name
        - name: OPERATOR_NAME
          value: flink-kubernetes-operator
        - name: FLINK_CONF_DIR
          value: /opt/flink/conf
        - name: FLINK_PLUGINS_DIR
          value: /opt/flink/plugins
        - name: LOG_CONFIG
          value: -Dlog4j.configurationFile=/opt/flink/conf/log4j-operator.properties
        - name: JVM_ARGS
          value: -Xmx32g -Xms32g -XX:+UseG1GC
        - name: TZ
          value: Asia/Shanghai
        image: ncr.nie.netease.com/v1-gdcstreaming/gdc-flink-kubernetes-operator:1.6.1-GDC1.0.2
        imagePullPolicy: IfNotPresent
        livenessProbe:           failureThreshold: 3
          httpGet:             path: /
            port: health-port
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: flink-kubernetes-operator
        ports:         - containerPort: 9999
          name: metrics
          protocol: TCP
        - containerPort: 8085
          name: health-port
          protocol: TCP
        resources:           limits:             cpu: "10"
            memory: 35Gi
          requests:             cpu: "10"
            memory: 35Gi
        securityContext: {}
        startupProbe:           failureThreshold: 30
          httpGet:             path: /
            port: health-port
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:         - mountPath: /opt/flink/conf
          name: flink-operator-config-volume
        - mountPath: /opt/scheduler/keytab
          name: flink-operator-keytab-volume
        - mountPath: /flink-data
          name: flink-operator-logs-volume
      - command:         - /docker-entrypoint.sh
        - webhook
        env:         - name: WEBHOOK_KEYSTORE_PASSWORD
          valueFrom:             secretKeyRef:               key: password
              name: flink-operator-webhook-secret
        - name: WEBHOOK_KEYSTORE_FILE
          value: /certs/keystore.p12
        - name: WEBHOOK_KEYSTORE_TYPE
          value: pkcs12
        - name: WEBHOOK_SERVER_PORT
          value: "9443"
        - name: LOG_CONFIG
          value: -Dlog4j.configurationFile=/opt/flink/conf/log4j-operator.properties
        - name: JVM_ARGS
        - name: FLINK_CONF_DIR
          value: /opt/flink/conf
        - name: FLINK_PLUGINS_DIR
          value: /opt/flink/plugins
        - name: OPERATOR_NAMESPACE
          valueFrom:             fieldRef:               apiVersion: v1
              fieldPath: metadata.namespace
        image: ncr.nie.netease.com/v1-gdcstreaming/gdc-flink-kubernetes-operator:1.6.1-GDC1.0.2
        imagePullPolicy: IfNotPresent
        name: flink-webhook
        resources: {}
        securityContext: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:         - mountPath: /certs
          name: keystore
          readOnly: true
        - mountPath: /opt/flink/conf
          name: flink-operator-config-volume
      dnsPolicy: ClusterFirst
      imagePullSecrets:       - name: ncr-pull-secret
      nodeSelector:         node-role.kubernetes.io/edge: ""
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:         runAsGroup: 0
        runAsUser: 0
      serviceAccount: flink-operator
      serviceAccountName: flink-operator
      terminationGracePeriodSeconds: 30
      volumes:       - configMap:           defaultMode: 420
          items:           - key: flink-conf.yaml
            path: flink-conf.yaml
          - key: log4j-operator.properties
            path: log4j-operator.properties
          - key: log4j-console.properties
            path: log4j-console.properties
          name: flink-operator-config
        name: flink-operator-config-volume
      - hostPath:           path: /cfs/flink/keytab
          type: Directory
        name: flink-operator-keytab-volume
      - hostPath:           path: /home/k8s/logs
          type: DirectoryOrCreate
        name: flink-operator-logs-volume
      - name: keystore
        secret:           defaultMode: 420
          items:           - key: keystore.p12
            path: keystore.p12
          secretName: webhook-server-cert
status:   availableReplicas: 2
  conditions:   - lastTransitionTime: "2024-03-13T02:55:09Z"
    lastUpdateTime: "2024-03-19T06:48:09Z"
    message: ReplicaSet "flink-kubernetes-operator-7756945f7" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2024-04-19T04:08:21Z"
    lastUpdateTime: "2024-04-19T04:08:21Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 3
  readyReplicas: 2
  replicas: 2
  updatedReplicas: 2
{code}
and try to analyze the heap memory using MAT, here are the results of the analysis

!screenshot-2.png!

 

It seems to point to a bug in the jdk [deleteOnExit Api|https://bugs.openjdk.org/browse/JDK-4513817], but the theory from this bug is that it would result in not enough heap memory, whereas according to the jvm memory metrics, there is enough heap memory for an exception exit. It's strange


was (Author: stupid_pig):
the yaml spec:
{code:yaml}
apiVersion: apps/v1
kind: Deployment
metadata:   annotations:     deployment.kubernetes.io/revision: "3"
    meta.helm.sh/release-name: flink-kubernetes-operator
    meta.helm.sh/release-namespace: streamfly
  creationTimestamp: "2024-03-13T02:55:09Z"
  generation: 3
  labels:     app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: flink-kubernetes-operator
    app.kubernetes.io/version: 1.6.1-GDC1.0.2
    helm.sh/chart: flink-kubernetes-operator-1.6.1-GDC1.0.2
  name: flink-kubernetes-operator
  namespace: streamfly
  resourceVersion: "8064936654"
  uid: 00418b62-820f-4e4a-a138-1ff81f605787
spec:   progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:     matchLabels:       app.kubernetes.io/name: flink-kubernetes-operator
  strategy:     rollingUpdate:       maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:     metadata:       annotations:         kubectl.kubernetes.io/default-container: flink-kubernetes-operator
      creationTimestamp: null
      labels:         app.kubernetes.io/name: flink-kubernetes-operator
    spec:       containers:       - command:         - /docker-entrypoint.sh
        - operator
        env:         - name: OPERATOR_NAMESPACE
          valueFrom:             fieldRef:               apiVersion: v1
              fieldPath: metadata.namespace
        - name: HOST_IP
          valueFrom:             fieldRef:               apiVersion: v1
              fieldPath: status.hostIP
        - name: POD_IP
          valueFrom:             fieldRef:               apiVersion: v1
              fieldPath: status.podIP
        - name: POD_NAME
          valueFrom:             fieldRef:               apiVersion: v1
              fieldPath: metadata.name
        - name: OPERATOR_NAME
          value: flink-kubernetes-operator
        - name: FLINK_CONF_DIR
          value: /opt/flink/conf
        - name: FLINK_PLUGINS_DIR
          value: /opt/flink/plugins
        - name: LOG_CONFIG
          value: -Dlog4j.configurationFile=/opt/flink/conf/log4j-operator.properties
        - name: JVM_ARGS
          value: -Xmx32g -Xms32g -XX:+UseG1GC
        - name: TZ
          value: Asia/Shanghai
        image: ncr.nie.netease.com/v1-gdcstreaming/gdc-flink-kubernetes-operator:1.6.1-GDC1.0.2
        imagePullPolicy: IfNotPresent
        livenessProbe:           failureThreshold: 3
          httpGet:             path: /
            port: health-port
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: flink-kubernetes-operator
        ports:         - containerPort: 9999
          name: metrics
          protocol: TCP
        - containerPort: 8085
          name: health-port
          protocol: TCP
        resources:           limits:             cpu: "10"
            memory: 35Gi
          requests:             cpu: "10"
            memory: 35Gi
        securityContext: {}
        startupProbe:           failureThreshold: 30
          httpGet:             path: /
            port: health-port
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:         - mountPath: /opt/flink/conf
          name: flink-operator-config-volume
        - mountPath: /opt/scheduler/keytab
          name: flink-operator-keytab-volume
        - mountPath: /flink-data
          name: flink-operator-logs-volume
      - command:         - /docker-entrypoint.sh
        - webhook
        env:         - name: WEBHOOK_KEYSTORE_PASSWORD
          valueFrom:             secretKeyRef:               key: password
              name: flink-operator-webhook-secret
        - name: WEBHOOK_KEYSTORE_FILE
          value: /certs/keystore.p12
        - name: WEBHOOK_KEYSTORE_TYPE
          value: pkcs12
        - name: WEBHOOK_SERVER_PORT
          value: "9443"
        - name: LOG_CONFIG
          value: -Dlog4j.configurationFile=/opt/flink/conf/log4j-operator.properties
        - name: JVM_ARGS
        - name: FLINK_CONF_DIR
          value: /opt/flink/conf
        - name: FLINK_PLUGINS_DIR
          value: /opt/flink/plugins
        - name: OPERATOR_NAMESPACE
          valueFrom:             fieldRef:               apiVersion: v1
              fieldPath: metadata.namespace
        image: ncr.nie.netease.com/v1-gdcstreaming/gdc-flink-kubernetes-operator:1.6.1-GDC1.0.2
        imagePullPolicy: IfNotPresent
        name: flink-webhook
        resources: {}
        securityContext: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:         - mountPath: /certs
          name: keystore
          readOnly: true
        - mountPath: /opt/flink/conf
          name: flink-operator-config-volume
      dnsPolicy: ClusterFirst
      imagePullSecrets:       - name: ncr-pull-secret
      nodeSelector:         node-role.kubernetes.io/edge: ""
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:         runAsGroup: 0
        runAsUser: 0
      serviceAccount: flink-operator
      serviceAccountName: flink-operator
      terminationGracePeriodSeconds: 30
      volumes:       - configMap:           defaultMode: 420
          items:           - key: flink-conf.yaml
            path: flink-conf.yaml
          - key: log4j-operator.properties
            path: log4j-operator.properties
          - key: log4j-console.properties
            path: log4j-console.properties
          name: flink-operator-config
        name: flink-operator-config-volume
      - hostPath:           path: /cfs/flink/keytab
          type: Directory
        name: flink-operator-keytab-volume
      - hostPath:           path: /home/k8s/logs
          type: DirectoryOrCreate
        name: flink-operator-logs-volume
      - name: keystore
        secret:           defaultMode: 420
          items:           - key: keystore.p12
            path: keystore.p12
          secretName: webhook-server-cert
status:   availableReplicas: 2
  conditions:   - lastTransitionTime: "2024-03-13T02:55:09Z"
    lastUpdateTime: "2024-03-19T06:48:09Z"
    message: ReplicaSet "flink-kubernetes-operator-7756945f7" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2024-04-19T04:08:21Z"
    lastUpdateTime: "2024-04-19T04:08:21Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 3
  readyReplicas: 2
  replicas: 2
  updatedReplicas: 2
{code}
and try to analyze the heap memory using MAT, here are the results of the analysis

!screenshot-2.png!

 

It seems to point to a bug in the jdk [deleteOnExit Api|https://bugs.openjdk.org/browse/JDK-4513817], but the theory from this bug is that it would result in not enough heap memory, whereas according to the jvm memory metrics, there is enough heap memory for an exception exit.

> operator oom
> ------------
>
>                 Key: FLINK-35192
>                 URL: https://issues.apache.org/jira/browse/FLINK-35192
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.6.1
>         Environment: jdk: openjdk11
> operator version: 1.6.1
>            Reporter: chenyuzhi
>            Priority: Major
>         Attachments: image-2024-04-22-15-47-49-455.png, image-2024-04-22-15-52-51-600.png, image-2024-04-22-15-58-23-269.png, image-2024-04-22-15-58-42-850.png, screenshot-1.png, screenshot-2.png
>
>
> The kubernetest operator docker process was killed by kernel cause out of memory(the time is 2024.04.03: 18:16)
>  !image-2024-04-22-15-47-49-455.png! 
> Metrics:
> the pod memory (RSS) is increasing slowly in the past 7 days:
>  !screenshot-1.png! 
> However the jvm memory metrics of operator not shown obvious anomaly:
>  !image-2024-04-22-15-58-23-269.png! 
>  !image-2024-04-22-15-58-42-850.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)