You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by "Chen, Mason" <ma...@sony.com> on 2020/08/05 00:39:20 UTC

Only One TaskManager Showing High CPU Usage

Hi all,

The issue is that only one out of two taskmanagers experience high cpu usage.[A picture containing sitting, oven, clock, computer  Description automatically generated]

I’m running a series of performance tests processing records at 50k rps. In this setup, I have 1 job manager (1 core, 1 gb) and 2 task managers (8 cores, 8 gb). Each of the taskmanagers have 8 task slots and we have a simple pipeline that reads from kafka, filters, and makes a http request downstream with the asyncio function.

All operators have parallelism of 8, except the filter (parallelism of 4) and the asyncio function (parallelism of 16). We do not have checkpointing turned on.

I thought maybe the operator chaining was causing issues in distributing the load, so I disabled operator chaining after the filter (before the asyncio). However, the issue still persisted and I did see somewhat even distribution of records before and after this change.

Some potential problems: the http client is not static so it will be recreated for each parallel instance of the asyncio operator (so, there’s gonna be a lot of executors.). At the cpu peak, I see 10k threads and it steadily grows to 40k at the end of the time period shown.


Does anyone have any ideas? In the 50k rps, about 500 out of those events need to hit the asyncio function (the filter filters out the unrelated events). I was doing fine before I added the unrelated events (just the 500 rps going to asyncio).

Thanks,
Mason

Re: Only One TaskManager Showing High CPU Usage

Posted by Jake <ft...@qq.com>.
Hi Mason

Can you use the jvm cpu perfrommance analysis tools?  

Jprofile and https://github.com/alibaba/arthas <https://github.com/alibaba/arthas>

You can probably guess the reason for the high CPU load.

Jake

> On Aug 6, 2020, at 12:25 PM, Chen, Mason <ma...@sony.com> wrote:
> 
> Thanks Peter for the reply. I noticed the behavior you described when I reduced the parallelism of the asyncio sink to 8—one task manager had its slots completely taken and the other one had all its slots completely open. To mitigate this behavior, I tried to use the setting `cluster.evenly-spread-out-slots: true`, but it didn’t fix anything (had expected the job manager to split the task slot requirements evenly between the two task managers). It seems like in general I should be extremely wary of the parallelism and number of task slots, and their effects on the cpu/memory usage…
> 
> I will use your work around to use parallelism of 8—I can scale the capacity of the asyncio accordingly, no problem there. For the filter function, I kept it at 4 since there’s a cache involved and I noticed that hit rate was worse when the parallelism was higher—I will use a keyBy to mitigate this.
>  
> From: Piotr Nowojski <piotr.nowojski@gmail.com <ma...@gmail.com>>
> Date: Wednesday, August 5, 2020 at 10:36 AM
> To: "Chen, Mason" <mason.chen@sony.com <ma...@sony.com>>
> Cc: "user@flink.apache.org <ma...@flink.apache.org>" <user@flink.apache.org <ma...@flink.apache.org>>
> Subject: Re: Only One TaskManager Showing High CPU Usage
>  
> Hi,
>  
> What I guess is happening is since you have 16 slots in total (8 slots per TM), while your operators have various levels of parallelism (8, 4, 16), Flink is scheduling all of the operators with parallelism < 16 on a TM that becomes available first to the scheduler. That's causing the visible load skew. Keep in mind that different operators are by default allowed to share the same task slot, unless you explicitly tell them to not do that [1].
>  
> One obvious work around would be to define the same parallelism for all of the operators, and that's the usual way to go, unless you have a really good reason not to. Can you try this out? Usually there is no harm in keeping more then required operator instances, and in your case you already have the highest parallelism in your Async function (the one that allocates the most resources?).
>  
> Till, is there a way to change this resource allocation/scheduling behaviour? To not pack everything on the same TM?
>  
> Piotrek
>  
> [1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/stream/operators/#task-chaining-and-resource-groups <https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/stream/operators/#task-chaining-and-resource-groups> 
>  
>  
> śr., 5 sie 2020 o 02:39 Chen, Mason <mason.chen@sony.com <ma...@sony.com>> napisał(a):
> Hi all,
> 
> The issue is that only one out of two taskmanagers experience high cpu usage.<image001.png>
> 
> I’m running a series of performance tests processing records at 50k rps. In this setup, I have 1 job manager (1 core, 1 gb) and 2 task managers (8 cores, 8 gb). Each of the taskmanagers have 8 task slots and we have a simple pipeline that reads from kafka, filters, and makes a http request downstream with the asyncio function.
> 
> All operators have parallelism of 8, except the filter (parallelism of 4) and the asyncio function (parallelism of 16). We do not have checkpointing turned on.
> 
> I thought maybe the operator chaining was causing issues in distributing the load, so I disabled operator chaining after the filter (before the asyncio). However, the issue still persisted and I did see somewhat even distribution of records before and after this change.
> 
> Some potential problems: the http client is not static so it will be recreated for each parallel instance of the asyncio operator (so, there’s gonna be a lot of executors.). At the cpu peak, I see 10k threads and it steadily grows to 40k at the end of the time period shown.
> 
> 
> Does anyone have any ideas? In the 50k rps, about 500 out of those events need to hit the asyncio function (the filter filters out the unrelated events). I was doing fine before I added the unrelated events (just the 500 rps going to asyncio).
>  
> Thanks,
> Mason


Re: Only One TaskManager Showing High CPU Usage

Posted by Piotr Nowojski <pn...@apache.org>.
Yes, as it is at the moment, you should be paying quite a bit of attention
how the tasks are being distributed across the cluster. Good that you
managed to somehow solve your problem and I hope that we will improve Flink
experience in such cases in the future.

Piotrek

czw., 6 sie 2020 o 06:25 Chen, Mason <ma...@sony.com> napisał(a):

> Thanks Peter for the reply. I noticed the behavior you described when I
> reduced the parallelism of the asyncio sink to 8—one task manager had its
> slots completely taken and the other one had all its slots completely open.
> To mitigate this behavior, I tried to use the setting
> `cluster.evenly-spread-out-slots: true`, but it didn’t fix anything (had
> expected the job manager to split the task slot requirements evenly between
> the two task managers). It seems like in general I should be extremely wary
> of the parallelism and number of task slots, and their effects on the
> cpu/memory usage…
>
>
> I will use your work around to use parallelism of 8—I can scale the
> capacity of the asyncio accordingly, no problem there. For the filter
> function, I kept it at 4 since there’s a cache involved and I noticed that
> hit rate was worse when the parallelism was higher—I will use a keyBy to
> mitigate this.
>
>
>
> *From: *Piotr Nowojski <pi...@gmail.com>
> *Date: *Wednesday, August 5, 2020 at 10:36 AM
> *To: *"Chen, Mason" <ma...@sony.com>
> *Cc: *"user@flink.apache.org" <us...@flink.apache.org>
> *Subject: *Re: Only One TaskManager Showing High CPU Usage
>
>
>
> Hi,
>
>
>
> What I guess is happening is since you have 16 slots in total (8 slots per
> TM), while your operators have various levels of parallelism (8, 4, 16),
> Flink is scheduling all of the operators with parallelism < 16 on a TM that
> becomes available first to the scheduler. That's causing the visible load
> skew. Keep in mind that different operators are by default allowed to share
> the same task slot, unless you explicitly tell them to not do that [1].
>
>
>
> One obvious work around would be to define the same parallelism for all of
> the operators, and that's the usual way to go, unless you have a really
> good reason not to. Can you try this out? Usually there is no harm in
> keeping more then required operator instances, and in your case you already
> have the highest parallelism in your Async function (the one that allocates
> the most resources?).
>
>
>
> Till, is there a way to change this resource allocation/scheduling
> behaviour? To not pack everything on the same TM?
>
>
>
> Piotrek
>
>
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/stream/operators/#task-chaining-and-resource-groups
>
>
>
>
>
>
> śr., 5 sie 2020 o 02:39 Chen, Mason <ma...@sony.com> napisał(a):
>
> Hi all,
>
> The issue is that only one out of two taskmanagers experience high cpu
> usage.[image: A picture containing sitting, oven, clock, computer
> Description automatically generated]
>
> I’m running a series of performance tests processing records at 50k rps.
> In this setup, I have 1 job manager (1 core, 1 gb) and 2 task managers (8
> cores, 8 gb). Each of the taskmanagers have 8 task slots and we have a
> simple pipeline that reads from kafka, filters, and makes a http request
> downstream with the asyncio function.
>
> All operators have parallelism of 8, except the filter (parallelism of 4)
> and the asyncio function (parallelism of 16). We do not have checkpointing
> turned on.
>
> I thought maybe the operator chaining was causing issues in distributing
> the load, so I disabled operator chaining after the filter (before the
> asyncio). However, the issue still persisted and I did see somewhat even
> distribution of records before and after this change.
>
> Some potential problems: the http client is not static so it will be
> recreated for each parallel instance of the asyncio operator (so, there’s
> gonna be a lot of executors.). At the cpu peak, I see 10k threads and it
> steadily grows to 40k at the end of the time period shown.
>
>
> Does anyone have any ideas? In the 50k rps, about 500 out of those events
> need to hit the asyncio function (the filter filters out the unrelated
> events). I was doing fine before I added the unrelated events (just the 500
> rps going to asyncio).
>
>
>
> Thanks,
>
> Mason
>
>

Re: Only One TaskManager Showing High CPU Usage

Posted by "Chen, Mason" <ma...@sony.com>.
Thanks Peter for the reply. I noticed the behavior you described when I reduced the parallelism of the asyncio sink to 8—one task manager had its slots completely taken and the other one had all its slots completely open. To mitigate this behavior, I tried to use the setting `cluster.evenly-spread-out-slots: true`, but it didn’t fix anything (had expected the job manager to split the task slot requirements evenly between the two task managers). It seems like in general I should be extremely wary of the parallelism and number of task slots, and their effects on the cpu/memory usage…

I will use your work around to use parallelism of 8—I can scale the capacity of the asyncio accordingly, no problem there. For the filter function, I kept it at 4 since there’s a cache involved and I noticed that hit rate was worse when the parallelism was higher—I will use a keyBy to mitigate this.

From: Piotr Nowojski <pi...@gmail.com>
Date: Wednesday, August 5, 2020 at 10:36 AM
To: "Chen, Mason" <ma...@sony.com>
Cc: "user@flink.apache.org" <us...@flink.apache.org>
Subject: Re: Only One TaskManager Showing High CPU Usage

Hi,

What I guess is happening is since you have 16 slots in total (8 slots per TM), while your operators have various levels of parallelism (8, 4, 16), Flink is scheduling all of the operators with parallelism < 16 on a TM that becomes available first to the scheduler. That's causing the visible load skew. Keep in mind that different operators are by default allowed to share the same task slot, unless you explicitly tell them to not do that [1].

One obvious work around would be to define the same parallelism for all of the operators, and that's the usual way to go, unless you have a really good reason not to. Can you try this out? Usually there is no harm in keeping more then required operator instances, and in your case you already have the highest parallelism in your Async function (the one that allocates the most resources?).

Till, is there a way to change this resource allocation/scheduling behaviour? To not pack everything on the same TM?

Piotrek

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/stream/operators/#task-chaining-and-resource-groups


śr., 5 sie 2020 o 02:39 Chen, Mason <ma...@sony.com>> napisał(a):
Hi all,

The issue is that only one out of two taskmanagers experience high cpu usage.[A picture containing sitting, oven, clock, computer  Description automatically generated]

I’m running a series of performance tests processing records at 50k rps. In this setup, I have 1 job manager (1 core, 1 gb) and 2 task managers (8 cores, 8 gb). Each of the taskmanagers have 8 task slots and we have a simple pipeline that reads from kafka, filters, and makes a http request downstream with the asyncio function.

All operators have parallelism of 8, except the filter (parallelism of 4) and the asyncio function (parallelism of 16). We do not have checkpointing turned on.

I thought maybe the operator chaining was causing issues in distributing the load, so I disabled operator chaining after the filter (before the asyncio). However, the issue still persisted and I did see somewhat even distribution of records before and after this change.

Some potential problems: the http client is not static so it will be recreated for each parallel instance of the asyncio operator (so, there’s gonna be a lot of executors.). At the cpu peak, I see 10k threads and it steadily grows to 40k at the end of the time period shown.


Does anyone have any ideas? In the 50k rps, about 500 out of those events need to hit the asyncio function (the filter filters out the unrelated events). I was doing fine before I added the unrelated events (just the 500 rps going to asyncio).

Thanks,
Mason

Re: Only One TaskManager Showing High CPU Usage

Posted by Piotr Nowojski <pi...@gmail.com>.
Hi,

What I guess is happening is since you have 16 slots in total (8 slots per
TM), while your operators have various levels of parallelism (8, 4, 16),
Flink is scheduling all of the operators with parallelism < 16 on a TM that
becomes available first to the scheduler. That's causing the visible load
skew. Keep in mind that different operators are by default allowed to share
the same task slot, unless you explicitly tell them to not do that [1].

One obvious work around would be to define the same parallelism for all of
the operators, and that's the usual way to go, unless you have a really
good reason not to. Can you try this out? Usually there is no harm in
keeping more then required operator instances, and in your case you already
have the highest parallelism in your Async function (the one that allocates
the most resources?).

Till, is there a way to change this resource allocation/scheduling
behaviour? To not pack everything on the same TM?

Piotrek

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/stream/operators/#task-chaining-and-resource-groups



śr., 5 sie 2020 o 02:39 Chen, Mason <ma...@sony.com> napisał(a):

> Hi all,
>
> The issue is that only one out of two taskmanagers experience high cpu
> usage.[image: A picture containing sitting, oven, clock, computer
> Description automatically generated]
>
> I’m running a series of performance tests processing records at 50k rps.
> In this setup, I have 1 job manager (1 core, 1 gb) and 2 task managers (8
> cores, 8 gb). Each of the taskmanagers have 8 task slots and we have a
> simple pipeline that reads from kafka, filters, and makes a http request
> downstream with the asyncio function.
>
> All operators have parallelism of 8, except the filter (parallelism of 4)
> and the asyncio function (parallelism of 16). We do not have checkpointing
> turned on.
>
> I thought maybe the operator chaining was causing issues in distributing
> the load, so I disabled operator chaining after the filter (before the
> asyncio). However, the issue still persisted and I did see somewhat even
> distribution of records before and after this change.
>
> Some potential problems: the http client is not static so it will be
> recreated for each parallel instance of the asyncio operator (so, there’s
> gonna be a lot of executors.). At the cpu peak, I see 10k threads and it
> steadily grows to 40k at the end of the time period shown.
>
>
> Does anyone have any ideas? In the 50k rps, about 500 out of those events
> need to hit the asyncio function (the filter filters out the unrelated
> events). I was doing fine before I added the unrelated events (just the 500
> rps going to asyncio).
>
>
>
> Thanks,
>
> Mason
>