You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user-zh@flink.apache.org by Eleanore Jin <el...@gmail.com> on 2020/04/29 23:40:53 UTC

Flink Task Manager GC overhead limit exceeded

Hi All,

Currently I am running a flink job cluster (v1.8.2) on kubernetes with 4
pods, each pod with 4 parallelism.

The flink job reads from a source topic with 96 partitions, and does per
element filter, the filtered value comes from a broadcast topic and it
always use the latest message as the filter criteria, then publish to a
sink topic.

There is no checkpointing and state involved.

Then I am seeing GC overhead limit exceeded error continuously and the pods
keep on restarting

So I tried to increase the heap size for task manager by

containers:

      - args:

        - task-manager

        - -Djobmanager.rpc.address=service-job-manager

        - -Dtaskmanager.heap.size=4096m

        - -Denv.java.opts.taskmanager="-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/dumps/oom.bin"


3 things I noticed,


1. I dont see the heap size from UI for task manager show correctly

[image: image.png]

2. I dont see the heap dump file in the restarted pod /dumps/oom.bin, did I
set the java opts wrong?

3. I continously seeing below logs from all pods, not sure if causes any
issue
{"@timestamp":"2020-04-29T23:39:43.387Z","@version":"1","message":"[Consumer
clientId=consumer-1, groupId=aba774bc] Node 6 was unable to process the
fetch request with (sessionId=2054451921, epoch=474):
FETCH_SESSION_ID_NOT_FOUND.","logger_name":"org.apache.kafka.clients.FetchSessionHandler","thread_name":"pool-6-thread-1","level":"INFO","level_value":20000}

Thanks a lot for any help!

Best,
Eleanore

Re: Flink Task Manager GC overhead limit exceeded

Posted by Xintong Song <to...@gmail.com>.
https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/mem_setup.html

Thank you~

Xintong Song



On Fri, May 1, 2020 at 8:35 AM shao.hongxiao <17...@163.com> wrote:

> 你好,宋
> Please refer to this document [1] for more details
> 能发一下具体链接吗,我也发现flink ui上显示的内存参数不太对,我想仔细看一下相关说明
>
> 谢谢啦
>
>
>
>
> | |
> 邵红晓
> |
> |
> 邮箱:17611022895@163.com
> |
>
> 签名由 网易邮箱大师 定制
>
> On 04/30/2020 12:08, Xintong Song wrote:
> Then I would suggest the following.
> - Check the task manager log to see if the '-D' properties are properly
> loaded. They should be located at the beginning of the log file.
> - You can also try to log into the pod and check the JVM launch command
> with "ps -ef | grep TaskManagerRunner". I suspect there might be some
> argument passing problem regarding the spaces and double quotation marks.
>
>
>
>
>
> Thank you~
>
> Xintong Song
>
>
>
>
>
> On Thu, Apr 30, 2020 at 11:39 AM Eleanore Jin <el...@gmail.com>
> wrote:
>
> Hi Xintong,
>
>
> Thanks for the detailed explanation!
>
>
> as for the 2nd question: I mount  it to am emptyDir, I assume pod restart
> will not cause the pod to be rescheduled to another node, so it should
> stay?  I verified by directly adding this to the flink-conf.yaml, which I
> see the heap dump is taken and stays in the directory:  env.java.opts:
> -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps
>
>
> In addition, I also don't see the log print out something like: Heap dump
> file created [5220997112 bytes in 73.464 secs], which I see when directly
> adding the options in the flink-conf.yaml
>
>
> containers:
>
> - volumeMounts:
>
>         - mountPath: /dumps
>
>           name: heap-dumps
>
> volumes:
>
>       - emptyDir: {}
>
>         name: heap-dumps
>
>
>
>
> Thanks a lot!
>
> Eleanore
>
>
>
> On Wed, Apr 29, 2020 at 7:55 PM Xintong Song <to...@gmail.com>
> wrote:
>
> Hi Eleanore,
>
>
> I'd like to explain about 1 & 2. For 3, I have no idea either.
>
>
>
> 1. I dont see the heap size from UI for task manager show correctly
>
>
>
> Despite the 'heap' in the key, 'taskmanager.heap.size' accounts for the
> total memory of a Flink task manager, rather than only the heap memory. A
> Flink task manager process consumes not only java heap memory, but also
> direct memory (e.g., network buffers) and native memory (e.g., JVM
> overhead). That's why the JVM heap size shown on the UI is much smaller
> than the configured 'taskmanager.heap.size'. Please refer to this document
> [1] for more details. This document comes from Flink 1.9 and has not been
> back-ported to 1.8, but the contents should apply to 1.8 as well.
>
>
> 2. I dont see the heap dump file in the restarted pod /dumps/oom.bin, did
> I set the java opts wrong?
>
>
>
> The java options look good to me. It the configured path '/dumps/oom.bin'
> a local path of the pod or a path of the host mounted onto the pod? The
> restarted pod is a completely new different pod. Everything you write to
> the old pod goes away as the pod terminated, unless they are written to the
> host through mounted storage.
>
>
>
> Thank you~
>
> Xintong Song
>
>
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/mem_setup.html
>
>
> On Thu, Apr 30, 2020 at 7:41 AM Eleanore Jin <el...@gmail.com>
> wrote:
>
> Hi All,
>
>
>
> Currently I am running a flink job cluster (v1.8.2) on kubernetes with 4
> pods, each pod with 4 parallelism.
>
>
> The flink job reads from a source topic with 96 partitions, and does per
> element filter, the filtered value comes from a broadcast topic and it
> always use the latest message as the filter criteria, then publish to a
> sink topic.
>
>
> There is no checkpointing and state involved.
>
>
> Then I am seeing GC overhead limit exceeded error continuously and the
> pods keep on restarting
>
>
> So I tried to increase the heap size for task manager by
>
> containers:
>
>       - args:
>
>         - task-manager
>
>         - -Djobmanager.rpc.address=service-job-manager
>
>         - -Dtaskmanager.heap.size=4096m
>
>         - -Denv.java.opts.taskmanager="-XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=/dumps/oom.bin"
>
>
>
>
> 3 things I noticed,
>
>
>
>
> 1. I dont see the heap size from UI for task manager show correctly
>
>
>
>
>
> 2. I dont see the heap dump file in the restarted pod /dumps/oom.bin, did
> I set the java opts wrong?
>
>
> 3. I continously seeing below logs from all pods, not sure if causes any
> issue
> {"@timestamp":"2020-04-29T23:39:43.387Z","@version":"1","message":"[Consumer
> clientId=consumer-1, groupId=aba774bc] Node 6 was unable to process the
> fetch request with (sessionId=2054451921, epoch=474):
> FETCH_SESSION_ID_NOT_FOUND.","logger_name":"org.apache.kafka.clients.FetchSessionHandler","thread_name":"pool-6-thread-1","level":"INFO","level_value":20000}
>
>
> Thanks a lot for any help!
>
>
> Best,
> Eleanore

Re: Flink Task Manager GC overhead limit exceeded

Posted by "shao.hongxiao" <17...@163.com>.
你好,宋
Please refer to this document [1] for more details   
能发一下具体链接吗,我也发现flink ui上显示的内存参数不太对,我想仔细看一下相关说明

谢谢啦




| |
邵红晓
|
|
邮箱:17611022895@163.com
|

签名由 网易邮箱大师 定制

On 04/30/2020 12:08, Xintong Song wrote:
Then I would suggest the following.
- Check the task manager log to see if the '-D' properties are properly loaded. They should be located at the beginning of the log file.
- You can also try to log into the pod and check the JVM launch command with "ps -ef | grep TaskManagerRunner". I suspect there might be some argument passing problem regarding the spaces and double quotation marks.





Thank you~

Xintong Song





On Thu, Apr 30, 2020 at 11:39 AM Eleanore Jin <el...@gmail.com> wrote:

Hi Xintong, 


Thanks for the detailed explanation!


as for the 2nd question: I mount  it to am emptyDir, I assume pod restart will not cause the pod to be rescheduled to another node, so it should stay?  I verified by directly adding this to the flink-conf.yaml, which I see the heap dump is taken and stays in the directory:  env.java.opts: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps


In addition, I also don't see the log print out something like: Heap dump file created [5220997112 bytes in 73.464 secs], which I see when directly adding the options in the flink-conf.yaml


containers:

- volumeMounts:

        - mountPath: /dumps

          name: heap-dumps

volumes:

      - emptyDir: {}

        name: heap-dumps




Thanks a lot!

Eleanore



On Wed, Apr 29, 2020 at 7:55 PM Xintong Song <to...@gmail.com> wrote:

Hi Eleanore,


I'd like to explain about 1 & 2. For 3, I have no idea either.



1. I dont see the heap size from UI for task manager show correctly



Despite the 'heap' in the key, 'taskmanager.heap.size' accounts for the total memory of a Flink task manager, rather than only the heap memory. A Flink task manager process consumes not only java heap memory, but also direct memory (e.g., network buffers) and native memory (e.g., JVM overhead). That's why the JVM heap size shown on the UI is much smaller than the configured 'taskmanager.heap.size'. Please refer to this document [1] for more details. This document comes from Flink 1.9 and has not been back-ported to 1.8, but the contents should apply to 1.8 as well.


2. I dont see the heap dump file in the restarted pod /dumps/oom.bin, did I set the java opts wrong?



The java options look good to me. It the configured path '/dumps/oom.bin' a local path of the pod or a path of the host mounted onto the pod? The restarted pod is a completely new different pod. Everything you write to the old pod goes away as the pod terminated, unless they are written to the host through mounted storage.



Thank you~

Xintong Song



[1] https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/mem_setup.html


On Thu, Apr 30, 2020 at 7:41 AM Eleanore Jin <el...@gmail.com> wrote:

Hi All, 



Currently I am running a flink job cluster (v1.8.2) on kubernetes with 4 pods, each pod with 4 parallelism. 


The flink job reads from a source topic with 96 partitions, and does per element filter, the filtered value comes from a broadcast topic and it always use the latest message as the filter criteria, then publish to a sink topic. 


There is no checkpointing and state involved. 


Then I am seeing GC overhead limit exceeded error continuously and the pods keep on restarting


So I tried to increase the heap size for task manager by

containers:

      - args:

        - task-manager

        - -Djobmanager.rpc.address=service-job-manager

        - -Dtaskmanager.heap.size=4096m

        - -Denv.java.opts.taskmanager="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps/oom.bin"




3 things I noticed, 




1. I dont see the heap size from UI for task manager show correctly





2. I dont see the heap dump file in the restarted pod /dumps/oom.bin, did I set the java opts wrong?


3. I continously seeing below logs from all pods, not sure if causes any issue
{"@timestamp":"2020-04-29T23:39:43.387Z","@version":"1","message":"[Consumer clientId=consumer-1, groupId=aba774bc] Node 6 was unable to process the fetch request with (sessionId=2054451921, epoch=474): FETCH_SESSION_ID_NOT_FOUND.","logger_name":"org.apache.kafka.clients.FetchSessionHandler","thread_name":"pool-6-thread-1","level":"INFO","level_value":20000}


Thanks a lot for any help!


Best,
Eleanore

Re: Flink Task Manager GC overhead limit exceeded

Posted by Xintong Song <to...@gmail.com>.
Then I would suggest the following.
- Check the task manager log to see if the '-D' properties are properly
loaded. They should be located at the beginning of the log file.
- You can also try to log into the pod and check the JVM launch command
with "ps -ef | grep TaskManagerRunner". I suspect there might be some
argument passing problem regarding the spaces and double quotation marks.


Thank you~

Xintong Song



On Thu, Apr 30, 2020 at 11:39 AM Eleanore Jin <el...@gmail.com>
wrote:

> Hi Xintong,
>
> Thanks for the detailed explanation!
>
> as for the 2nd question: I mount  it to am emptyDir, I assume pod restart
> will not cause the pod to be rescheduled to another node, so it should
> stay?  I verified by directly adding this to the flink-conf.yaml, which I
> see the heap dump is taken and stays in the directory:  env.java.opts:
> -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps
>
> In addition, I also don't see the log print out something like: Heap dump
> file created [5220997112 bytes in 73.464 secs], which I see when directly
> adding the options in the flink-conf.yaml
>
> containers:
>
> - volumeMounts:
>
>         - mountPath: /dumps
>
>           name: heap-dumps
>
> volumes:
>
>       - emptyDir: {}
>
>         name: heap-dumps
>
>
> Thanks a lot!
>
> Eleanore
>
> On Wed, Apr 29, 2020 at 7:55 PM Xintong Song <to...@gmail.com>
> wrote:
>
>> Hi Eleanore,
>>
>> I'd like to explain about 1 & 2. For 3, I have no idea either.
>>
>> 1. I dont see the heap size from UI for task manager show correctly
>>>
>>
>> Despite the 'heap' in the key, 'taskmanager.heap.size' accounts for the
>> total memory of a Flink task manager, rather than only the heap memory. A
>> Flink task manager process consumes not only java heap memory, but also
>> direct memory (e.g., network buffers) and native memory (e.g., JVM
>> overhead). That's why the JVM heap size shown on the UI is much smaller
>> than the configured 'taskmanager.heap.size'. Please refer to this document
>> [1] for more details. This document comes from Flink 1.9 and has not been
>> back-ported to 1.8, but the contents should apply to 1.8 as well.
>>
>> 2. I dont see the heap dump file in the restarted pod /dumps/oom.bin, did
>>> I set the java opts wrong?
>>>
>>
>> The java options look good to me. It the configured path '/dumps/oom.bin'
>> a local path of the pod or a path of the host mounted onto the pod? The
>> restarted pod is a completely new different pod. Everything you write to
>> the old pod goes away as the pod terminated, unless they are written to the
>> host through mounted storage.
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/mem_setup.html
>>
>> On Thu, Apr 30, 2020 at 7:41 AM Eleanore Jin <el...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> Currently I am running a flink job cluster (v1.8.2) on kubernetes with 4
>>> pods, each pod with 4 parallelism.
>>>
>>> The flink job reads from a source topic with 96 partitions, and does per
>>> element filter, the filtered value comes from a broadcast topic and it
>>> always use the latest message as the filter criteria, then publish to a
>>> sink topic.
>>>
>>> There is no checkpointing and state involved.
>>>
>>> Then I am seeing GC overhead limit exceeded error continuously and the
>>> pods keep on restarting
>>>
>>> So I tried to increase the heap size for task manager by
>>>
>>> containers:
>>>
>>>       - args:
>>>
>>>         - task-manager
>>>
>>>         - -Djobmanager.rpc.address=service-job-manager
>>>
>>>         - -Dtaskmanager.heap.size=4096m
>>>
>>>         - -Denv.java.opts.taskmanager="-XX:+HeapDumpOnOutOfMemoryError
>>> -XX:HeapDumpPath=/dumps/oom.bin"
>>>
>>>
>>> 3 things I noticed,
>>>
>>>
>>> 1. I dont see the heap size from UI for task manager show correctly
>>>
>>> [image: image.png]
>>>
>>> 2. I dont see the heap dump file in the restarted pod /dumps/oom.bin,
>>> did I set the java opts wrong?
>>>
>>> 3. I continously seeing below logs from all pods, not sure if causes any
>>> issue
>>> {"@timestamp":"2020-04-29T23:39:43.387Z","@version":"1","message":"[Consumer
>>> clientId=consumer-1, groupId=aba774bc] Node 6 was unable to process the
>>> fetch request with (sessionId=2054451921, epoch=474):
>>> FETCH_SESSION_ID_NOT_FOUND.","logger_name":"org.apache.kafka.clients.FetchSessionHandler","thread_name":"pool-6-thread-1","level":"INFO","level_value":20000}
>>>
>>> Thanks a lot for any help!
>>>
>>> Best,
>>> Eleanore
>>>
>>

Re: Flink Task Manager GC overhead limit exceeded

Posted by Xintong Song <to...@gmail.com>.
Then I would suggest the following.
- Check the task manager log to see if the '-D' properties are properly
loaded. They should be located at the beginning of the log file.
- You can also try to log into the pod and check the JVM launch command
with "ps -ef | grep TaskManagerRunner". I suspect there might be some
argument passing problem regarding the spaces and double quotation marks.


Thank you~

Xintong Song



On Thu, Apr 30, 2020 at 11:39 AM Eleanore Jin <el...@gmail.com>
wrote:

> Hi Xintong,
>
> Thanks for the detailed explanation!
>
> as for the 2nd question: I mount  it to am emptyDir, I assume pod restart
> will not cause the pod to be rescheduled to another node, so it should
> stay?  I verified by directly adding this to the flink-conf.yaml, which I
> see the heap dump is taken and stays in the directory:  env.java.opts:
> -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps
>
> In addition, I also don't see the log print out something like: Heap dump
> file created [5220997112 bytes in 73.464 secs], which I see when directly
> adding the options in the flink-conf.yaml
>
> containers:
>
> - volumeMounts:
>
>         - mountPath: /dumps
>
>           name: heap-dumps
>
> volumes:
>
>       - emptyDir: {}
>
>         name: heap-dumps
>
>
> Thanks a lot!
>
> Eleanore
>
> On Wed, Apr 29, 2020 at 7:55 PM Xintong Song <to...@gmail.com>
> wrote:
>
>> Hi Eleanore,
>>
>> I'd like to explain about 1 & 2. For 3, I have no idea either.
>>
>> 1. I dont see the heap size from UI for task manager show correctly
>>>
>>
>> Despite the 'heap' in the key, 'taskmanager.heap.size' accounts for the
>> total memory of a Flink task manager, rather than only the heap memory. A
>> Flink task manager process consumes not only java heap memory, but also
>> direct memory (e.g., network buffers) and native memory (e.g., JVM
>> overhead). That's why the JVM heap size shown on the UI is much smaller
>> than the configured 'taskmanager.heap.size'. Please refer to this document
>> [1] for more details. This document comes from Flink 1.9 and has not been
>> back-ported to 1.8, but the contents should apply to 1.8 as well.
>>
>> 2. I dont see the heap dump file in the restarted pod /dumps/oom.bin, did
>>> I set the java opts wrong?
>>>
>>
>> The java options look good to me. It the configured path '/dumps/oom.bin'
>> a local path of the pod or a path of the host mounted onto the pod? The
>> restarted pod is a completely new different pod. Everything you write to
>> the old pod goes away as the pod terminated, unless they are written to the
>> host through mounted storage.
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/mem_setup.html
>>
>> On Thu, Apr 30, 2020 at 7:41 AM Eleanore Jin <el...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> Currently I am running a flink job cluster (v1.8.2) on kubernetes with 4
>>> pods, each pod with 4 parallelism.
>>>
>>> The flink job reads from a source topic with 96 partitions, and does per
>>> element filter, the filtered value comes from a broadcast topic and it
>>> always use the latest message as the filter criteria, then publish to a
>>> sink topic.
>>>
>>> There is no checkpointing and state involved.
>>>
>>> Then I am seeing GC overhead limit exceeded error continuously and the
>>> pods keep on restarting
>>>
>>> So I tried to increase the heap size for task manager by
>>>
>>> containers:
>>>
>>>       - args:
>>>
>>>         - task-manager
>>>
>>>         - -Djobmanager.rpc.address=service-job-manager
>>>
>>>         - -Dtaskmanager.heap.size=4096m
>>>
>>>         - -Denv.java.opts.taskmanager="-XX:+HeapDumpOnOutOfMemoryError
>>> -XX:HeapDumpPath=/dumps/oom.bin"
>>>
>>>
>>> 3 things I noticed,
>>>
>>>
>>> 1. I dont see the heap size from UI for task manager show correctly
>>>
>>> [image: image.png]
>>>
>>> 2. I dont see the heap dump file in the restarted pod /dumps/oom.bin,
>>> did I set the java opts wrong?
>>>
>>> 3. I continously seeing below logs from all pods, not sure if causes any
>>> issue
>>> {"@timestamp":"2020-04-29T23:39:43.387Z","@version":"1","message":"[Consumer
>>> clientId=consumer-1, groupId=aba774bc] Node 6 was unable to process the
>>> fetch request with (sessionId=2054451921, epoch=474):
>>> FETCH_SESSION_ID_NOT_FOUND.","logger_name":"org.apache.kafka.clients.FetchSessionHandler","thread_name":"pool-6-thread-1","level":"INFO","level_value":20000}
>>>
>>> Thanks a lot for any help!
>>>
>>> Best,
>>> Eleanore
>>>
>>

Re: Flink Task Manager GC overhead limit exceeded

Posted by Xintong Song <to...@gmail.com>.
Hi Eleanore,

I'd like to explain about 1 & 2. For 3, I have no idea either.

1. I dont see the heap size from UI for task manager show correctly
>

Despite the 'heap' in the key, 'taskmanager.heap.size' accounts for the
total memory of a Flink task manager, rather than only the heap memory. A
Flink task manager process consumes not only java heap memory, but also
direct memory (e.g., network buffers) and native memory (e.g., JVM
overhead). That's why the JVM heap size shown on the UI is much smaller
than the configured 'taskmanager.heap.size'. Please refer to this document
[1] for more details. This document comes from Flink 1.9 and has not been
back-ported to 1.8, but the contents should apply to 1.8 as well.

2. I dont see the heap dump file in the restarted pod /dumps/oom.bin, did I
> set the java opts wrong?
>

The java options look good to me. It the configured path '/dumps/oom.bin' a
local path of the pod or a path of the host mounted onto the pod? The
restarted pod is a completely new different pod. Everything you write to
the old pod goes away as the pod terminated, unless they are written to the
host through mounted storage.

Thank you~

Xintong Song


[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/mem_setup.html

On Thu, Apr 30, 2020 at 7:41 AM Eleanore Jin <el...@gmail.com> wrote:

> Hi All,
>
> Currently I am running a flink job cluster (v1.8.2) on kubernetes with 4
> pods, each pod with 4 parallelism.
>
> The flink job reads from a source topic with 96 partitions, and does per
> element filter, the filtered value comes from a broadcast topic and it
> always use the latest message as the filter criteria, then publish to a
> sink topic.
>
> There is no checkpointing and state involved.
>
> Then I am seeing GC overhead limit exceeded error continuously and the
> pods keep on restarting
>
> So I tried to increase the heap size for task manager by
>
> containers:
>
>       - args:
>
>         - task-manager
>
>         - -Djobmanager.rpc.address=service-job-manager
>
>         - -Dtaskmanager.heap.size=4096m
>
>         - -Denv.java.opts.taskmanager="-XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=/dumps/oom.bin"
>
>
> 3 things I noticed,
>
>
> 1. I dont see the heap size from UI for task manager show correctly
>
> [image: image.png]
>
> 2. I dont see the heap dump file in the restarted pod /dumps/oom.bin, did
> I set the java opts wrong?
>
> 3. I continously seeing below logs from all pods, not sure if causes any
> issue
> {"@timestamp":"2020-04-29T23:39:43.387Z","@version":"1","message":"[Consumer
> clientId=consumer-1, groupId=aba774bc] Node 6 was unable to process the
> fetch request with (sessionId=2054451921, epoch=474):
> FETCH_SESSION_ID_NOT_FOUND.","logger_name":"org.apache.kafka.clients.FetchSessionHandler","thread_name":"pool-6-thread-1","level":"INFO","level_value":20000}
>
> Thanks a lot for any help!
>
> Best,
> Eleanore
>

Re: Flink Task Manager GC overhead limit exceeded

Posted by Xintong Song <to...@gmail.com>.
Hi Eleanore,

I'd like to explain about 1 & 2. For 3, I have no idea either.

1. I dont see the heap size from UI for task manager show correctly
>

Despite the 'heap' in the key, 'taskmanager.heap.size' accounts for the
total memory of a Flink task manager, rather than only the heap memory. A
Flink task manager process consumes not only java heap memory, but also
direct memory (e.g., network buffers) and native memory (e.g., JVM
overhead). That's why the JVM heap size shown on the UI is much smaller
than the configured 'taskmanager.heap.size'. Please refer to this document
[1] for more details. This document comes from Flink 1.9 and has not been
back-ported to 1.8, but the contents should apply to 1.8 as well.

2. I dont see the heap dump file in the restarted pod /dumps/oom.bin, did I
> set the java opts wrong?
>

The java options look good to me. It the configured path '/dumps/oom.bin' a
local path of the pod or a path of the host mounted onto the pod? The
restarted pod is a completely new different pod. Everything you write to
the old pod goes away as the pod terminated, unless they are written to the
host through mounted storage.

Thank you~

Xintong Song


[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/mem_setup.html

On Thu, Apr 30, 2020 at 7:41 AM Eleanore Jin <el...@gmail.com> wrote:

> Hi All,
>
> Currently I am running a flink job cluster (v1.8.2) on kubernetes with 4
> pods, each pod with 4 parallelism.
>
> The flink job reads from a source topic with 96 partitions, and does per
> element filter, the filtered value comes from a broadcast topic and it
> always use the latest message as the filter criteria, then publish to a
> sink topic.
>
> There is no checkpointing and state involved.
>
> Then I am seeing GC overhead limit exceeded error continuously and the
> pods keep on restarting
>
> So I tried to increase the heap size for task manager by
>
> containers:
>
>       - args:
>
>         - task-manager
>
>         - -Djobmanager.rpc.address=service-job-manager
>
>         - -Dtaskmanager.heap.size=4096m
>
>         - -Denv.java.opts.taskmanager="-XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=/dumps/oom.bin"
>
>
> 3 things I noticed,
>
>
> 1. I dont see the heap size from UI for task manager show correctly
>
> [image: image.png]
>
> 2. I dont see the heap dump file in the restarted pod /dumps/oom.bin, did
> I set the java opts wrong?
>
> 3. I continously seeing below logs from all pods, not sure if causes any
> issue
> {"@timestamp":"2020-04-29T23:39:43.387Z","@version":"1","message":"[Consumer
> clientId=consumer-1, groupId=aba774bc] Node 6 was unable to process the
> fetch request with (sessionId=2054451921, epoch=474):
> FETCH_SESSION_ID_NOT_FOUND.","logger_name":"org.apache.kafka.clients.FetchSessionHandler","thread_name":"pool-6-thread-1","level":"INFO","level_value":20000}
>
> Thanks a lot for any help!
>
> Best,
> Eleanore
>