You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Trystan (Jira)" <ji...@apache.org> on 2020/05/15 22:52:00 UTC

[jira] [Comment Edited] (FLINK-16267) Flink uses more memory than taskmanager.memory.process.size in Kubernetes

    [ https://issues.apache.org/jira/browse/FLINK-16267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108708#comment-17108708 ] 

Trystan edited comment on FLINK-16267 at 5/15/20, 10:51 PM:
------------------------------------------------------------

[~czchen] did you ever resolve this? I just upgraded a job to 1.10.1, set *taskmanager.memory.process.size: 6g*, and the taskmanagers are using ~9GB. We are also using rocksdb. We have several custom sinks which use ThreadPoolExecutors with a high thread count (~250), just for context.

I was very surprised to see the taskmanager able to use 9GB of memory - I thought the whole point of this feature was that that shouldn't be possible.

I have another job using `filesystem` backend, and it very strictly respects the limit - we also have a whole lot more user code in that one, and still it doesn't exceed the memory usage.


was (Author: trystan):
[~czchen] did you ever resolve this? I just upgraded a job to 1.10.1, set *taskmanager.memory.process.size: 6g*, and the taskmanagers are using ~9GB. We are also using rocksdb. We have several custom sinks which use ThreadPoolExecutors with a high thread count (~250).

I was very surprised to see the taskmanager able to use 9GB of memory - I thought the whole point of this feature was that that shouldn't be possible.

I have another job using `filesystem` backend, and it very strictly respects the limit - we also have a whole lot more user code in that one, and still it doesn't exceed the memory usage.

> Flink uses more memory than taskmanager.memory.process.size in Kubernetes
> -------------------------------------------------------------------------
>
>                 Key: FLINK-16267
>                 URL: https://issues.apache.org/jira/browse/FLINK-16267
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.10.0
>            Reporter: ChangZhuo Chen (陳昌倬)
>            Priority: Major
>         Attachments: flink-conf_1.10.0.yaml, flink-conf_1.9.1.yaml, oomkilled_taskmanager.log
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This issue is from [https://stackoverflow.com/questions/60336764/flink-uses-more-memory-than-taskmanager-memory-process-size-in-kubernetes]
> h1. Description
>  * In Flink 1.10.0, we try to use `taskmanager.memory.process.size` to limit the resource used by taskmanager to ensure they are not killed by Kubernetes. However, we still get lots of taskmanager `OOMKilled`. The setup is in the following section.
>  * The taskmanager log is in attachment [^oomkilled_taskmanager.log].
> h2. Kubernete
>  * The Kubernetes setup is the same as described in [https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/kubernetes.html].
>  * The following is resource configuration for taskmanager deployment in Kubernetes:
> {{resources:}}
>  {{  requests:}}
>  {{    cpu: 1000m}}
>  {{    memory: 4096Mi}}
>  {{  limits:}}
>  {{    cpu: 1000m}}
>  {{    memory: 4096Mi}}
> h2. Flink Docker
>  * The Flink docker is built by the following Docker file.
> {{FROM flink:1.10-scala_2.11}}
> RUN mkdir -p /opt/flink/plugins/s3 &&
> ln -s /opt/flink/opt/flink-s3-fs-presto-1.10.0.jar /opt/flink/plugins/s3/
>  {{RUN ln -s /opt/flink/opt/flink-metrics-prometheus-1.10.0.jar /opt/flink/lib/}}
> h2. Flink Configuration
>  * The following are all memory related configurations in `flink-conf.yaml` in 1.10.0:
> {{jobmanager.heap.size: 820m}}
>  {{taskmanager.memory.jvm-metaspace.size: 128m}}
>  {{taskmanager.memory.process.size: 4096m}}
>  * We use RocksDB and we don't set `state.backend.rocksdb.memory.managed` in `flink-conf.yaml`.
>  ** Use S3 as checkpoint storage.
>  * The code uses DateStream API
>  ** input/output are both Kafka.
> h2. Project Dependencies
>  * The following is our dependencies.
> {{val flinkVersion = "1.10.0"}}{{libraryDependencies += "com.squareup.okhttp3" % "okhttp" % "4.2.2"}}
>  {{libraryDependencies += "com.typesafe" % "config" % "1.4.0"}}
>  {{libraryDependencies += "joda-time" % "joda-time" % "2.10.5"}}
>  {{libraryDependencies += "org.apache.flink" %% "flink-connector-kafka" % flinkVersion}}
>  {{libraryDependencies += "org.apache.flink" % "flink-metrics-dropwizard" % flinkVersion}}
>  {{libraryDependencies += "org.apache.flink" %% "flink-scala" % flinkVersion % "provided"}}
>  {{libraryDependencies += "org.apache.flink" %% "flink-statebackend-rocksdb" % flinkVersion % "provided"}}
>  {{libraryDependencies += "org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided"}}
>  {{libraryDependencies += "org.json4s" %% "json4s-jackson" % "3.6.7"}}
>  {{libraryDependencies += "org.log4s" %% "log4s" % "1.8.2"}}
>  {{libraryDependencies += "org.rogach" %% "scallop" % "3.3.1"}}
> h2. Previous Flink 1.9.1 Configuration
>  * The configuration we used in Flink 1.9.1 are the following. It does not have `OOMKilled`.
> h3. Kubernetes
> {{resources:}}
>  {{  requests:}}
>  {{    cpu: 1200m}}
>  {{    memory: 2G}}
>  {{  limits:}}
>  {{    cpu: 1500m}}
>  {{    memory: 2G}}
> h3. Flink 1.9.1
> {{jobmanager.heap.size: 820m}}
>  {{taskmanager.heap.size: 1024m}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)