You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Yu Li (Jira)" <ji...@apache.org> on 2020/02/26 02:45:00 UTC
[jira] [Comment Edited] (FLINK-16267) Flink uses more memory than taskmanager.memory.process.size in Kubernetes

    [ https://issues.apache.org/jira/browse/FLINK-16267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17045095#comment-17045095 ] 

Yu Li edited comment on FLINK-16267 at 2/26/20 2:44 AM:
--------------------------------------------------------

[~czchen] Thanks for the quick response. Could you set {{state.backend.rocksdb.memory.managed: false}} in your 1.10.0 yaml and check whether the issue remains? This could help us to judge whether the problem lies in RocksDB memory management (introduced in 1.10.0) or not. If it does, we will give more suggestions about how to locate the root cause. Thanks.

Besides, two questions about the configuration:

# From both the K8S resource spec and yaml configuration we could tell the memory set for TM increases from 2GB to 4GB, could you share the reason behind? Is it because of the `OOMKilled` issue and you tried to resolve it by increasing the memory setting? Or maybe the job parallelism has been changed accordingly (reduced to half of the before value)?
# For the 1.9.1 yaml settings, from the description {{taskmanager.heap.size}} was set to 1024m while 2000m in the yaml file attached. Could you double check and confirm which one is accurate?

Thanks.


was (Author: carp84):
[~czchen] Thanks for the quick response. Could you set {{state.backend.rocksdb.memory.managed: false}} in your 1.10.0 yaml and check whether the issue remains? This could help us to judge whether the problem lies in RocksDB memory management (introduced in 1.10.0 or not. If it does, we will give more suggestions about how to locate the root cause. Thanks.

Besides, two questions about the configuration:

# From both the K8S resource spec and yaml configuration we could tell the memory set for TM increases from 2GB to 4GB, could you share the reason behind? Is it because of the `OOMKilled` issue and you tried to resolve it by increasing the memory setting? Or maybe the job parallelism has been changed accordingly (reduced to half of the before value)?
# For the 1.9.1 yaml settings, from the description {{taskmanager.heap.size}} was set to 1024m while 2000m in the yaml file attached. Could you double check and confirm which one is accurate?

Thanks.

> Flink uses more memory than taskmanager.memory.process.size in Kubernetes
> -------------------------------------------------------------------------
>
>                 Key: FLINK-16267
>                 URL: https://issues.apache.org/jira/browse/FLINK-16267
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.10.0
>            Reporter: ChangZhuo Chen (陳昌倬)
>            Priority: Major
>         Attachments: flink-conf_1.10.0.yaml, flink-conf_1.9.1.yaml, oomkilled_taskmanager.log
>
>
> This issue is from [https://stackoverflow.com/questions/60336764/flink-uses-more-memory-than-taskmanager-memory-process-size-in-kubernetes]
> h1. Description
>  * In Flink 1.10.0, we try to use `taskmanager.memory.process.size` to limit the resource used by taskmanager to ensure they are not killed by Kubernetes. However, we still get lots of taskmanager `OOMKilled`. The setup is in the following section.
>  * The taskmanager log is in attachment [^oomkilled_taskmanager.log].
> h2. Kubernete
>  * The Kubernetes setup is the same as described in [https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/kubernetes.html].
>  * The following is resource configuration for taskmanager deployment in Kubernetes:
> {{resources:}}
>  {{  requests:}}
>  {{    cpu: 1000m}}
>  {{    memory: 4096Mi}}
>  {{  limits:}}
>  {{    cpu: 1000m}}
>  {{    memory: 4096Mi}}
> h2. Flink Docker
>  * The Flink docker is built by the following Docker file.
> {{FROM flink:1.10-scala_2.11}}
> RUN mkdir -p /opt/flink/plugins/s3 &&
> ln -s /opt/flink/opt/flink-s3-fs-presto-1.10.0.jar /opt/flink/plugins/s3/
>  {{RUN ln -s /opt/flink/opt/flink-metrics-prometheus-1.10.0.jar /opt/flink/lib/}}
> h2. Flink Configuration
>  * The following are all memory related configurations in `flink-conf.yaml` in 1.10.0:
> {{jobmanager.heap.size: 820m}}
>  {{taskmanager.memory.jvm-metaspace.size: 128m}}
>  {{taskmanager.memory.process.size: 4096m}}
>  * We use RocksDB and we don't set `state.backend.rocksdb.memory.managed` in `flink-conf.yaml`.
>  ** Use S3 as checkpoint storage.
>  * The code uses DateStream API
>  ** input/output are both Kafka.
> h2. Project Dependencies
>  * The following is our dependencies.
> {{val flinkVersion = "1.10.0"}}{{libraryDependencies += "com.squareup.okhttp3" % "okhttp" % "4.2.2"}}
>  {{libraryDependencies += "com.typesafe" % "config" % "1.4.0"}}
>  {{libraryDependencies += "joda-time" % "joda-time" % "2.10.5"}}
>  {{libraryDependencies += "org.apache.flink" %% "flink-connector-kafka" % flinkVersion}}
>  {{libraryDependencies += "org.apache.flink" % "flink-metrics-dropwizard" % flinkVersion}}
>  {{libraryDependencies += "org.apache.flink" %% "flink-scala" % flinkVersion % "provided"}}
>  {{libraryDependencies += "org.apache.flink" %% "flink-statebackend-rocksdb" % flinkVersion % "provided"}}
>  {{libraryDependencies += "org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided"}}
>  {{libraryDependencies += "org.json4s" %% "json4s-jackson" % "3.6.7"}}
>  {{libraryDependencies += "org.log4s" %% "log4s" % "1.8.2"}}
>  {{libraryDependencies += "org.rogach" %% "scallop" % "3.3.1"}}
> h2. Previous Flink 1.9.1 Configuration
>  * The configuration we used in Flink 1.9.1 are the following. It does not have `OOMKilled`.
> h3. Kubernetes
> {{resources:}}
>  {{  requests:}}
>  {{    cpu: 1200m}}
>  {{    memory: 2G}}
>  {{  limits:}}
>  {{    cpu: 1500m}}
>  {{    memory: 2G}}
> h3. Flink 1.9.1
> {{jobmanager.heap.size: 820m}}
>  {{taskmanager.heap.size: 1024m}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)