You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Ragini Manjaiah <ra...@gmail.com> on 2021/09/27 06:05:02 UTC

flink job : TPS drops from 400 to 30 TPS

Hi ,
I have a flink real time job which  processes user records via topic and
also reading data from hbase acting as a look table . If the look table
does not contain required metadata then it queries the external db via api
. First 1to 2 hours it works fine without issues, later it drops down
drastically to 30 TPS. What are the things I need to look into in such a
situation? There are no exceptions caught . how to check the bottle neck
area . can some throw some light on this.


Thanks & Regards
Ragini Manjaiah

Re: flink job : TPS drops from 400 to 30 TPS

Posted by JING ZHANG <be...@gmail.com>.

Hi,
About cpu cost, there are several methods:
1. Flink builtin metric: `Status.JVM.CPU.Load` [1]
2. Use `top` command on the target machine which deploys a suspect
TaskManager
3. You could use flame graph to do deeper profiler of a JVM [2].
...

About RPC response, I'm not an expert on HBase, I'm not sure whether HBase
cluster has metrics to trace each RPC response time.
You could also add metric to trace the time cost of a remote request in an
extension HBase connector.

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/metrics/#cpu
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/debugging/flame_graphs/

Hope it helps.

Best,
JING ZHANG

Ragini Manjaiah <ra...@gmail.com> 于2021年9月27日周一 下午5:25写道：

> please let me know how to check Does RPC response and CPU cost
>
> On Mon, Sep 27, 2021 at 1:19 PM JING ZHANG <be...@gmail.com> wrote:
>
>> Hi,
>> Since there is not enough information, you could first check the back
>> pressure status of the job [1], find the task which caused the back
>> pressure.
>> Then try to find out why the task processed data slowly, there are many
>> reasons, for example the following reasons:
>> (1) Does data skew exist, which means some tasks processed more input
>> data than the other tasks?
>> (2) Is the CPU cost very high?
>> (3) Does RPC response start to slow down？
>> (4) If you choose async mode lookup, the LookupJoin operator needs to
>> buffer some data into state. Which state backend do you use? Does the state
>> backend work fine?
>> ...
>>
>> Would you please provide more information about the job, for example back
>> pressure status, input data distribution, async mode or sync mode lookup.
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/monitoring/back_pressure/
>>
>> Best,
>> JING ZHANG
>>
>> Ragini Manjaiah <ra...@gmail.com> 于2021年9月27日周一 下午2:05写道：
>>
>>> Hi ,
>>> I have a flink real time job which  processes user records via topic and
>>> also reading data from hbase acting as a look table . If the look table
>>> does not contain required metadata then it queries the external db via api
>>> . First 1to 2 hours it works fine without issues, later it drops down
>>> drastically to 30 TPS. What are the things I need to look into in such a
>>> situation? There are no exceptions caught . how to check the bottle neck
>>> area . can some throw some light on this.
>>>
>>>
>>> Thanks & Regards
>>> Ragini Manjaiah
>>>
>>>

Re: flink job : TPS drops from 400 to 30 TPS

Posted by Ragini Manjaiah <ra...@gmail.com>.

please let me know how to check Does RPC response and CPU cost

On Mon, Sep 27, 2021 at 1:19 PM JING ZHANG <be...@gmail.com> wrote:

> Hi,
> Since there is not enough information, you could first check the back
> pressure status of the job [1], find the task which caused the back
> pressure.
> Then try to find out why the task processed data slowly, there are many
> reasons, for example the following reasons:
> (1) Does data skew exist, which means some tasks processed more input data
> than the other tasks?
> (2) Is the CPU cost very high?
> (3) Does RPC response start to slow down？
> (4) If you choose async mode lookup, the LookupJoin operator needs to
> buffer some data into state. Which state backend do you use? Does the state
> backend work fine?
> ...
>
> Would you please provide more information about the job, for example back
> pressure status, input data distribution, async mode or sync mode lookup.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/monitoring/back_pressure/
>
> Best,
> JING ZHANG
>
> Ragini Manjaiah <ra...@gmail.com> 于2021年9月27日周一 下午2:05写道：
>
>> Hi ,
>> I have a flink real time job which  processes user records via topic and
>> also reading data from hbase acting as a look table . If the look table
>> does not contain required metadata then it queries the external db via api
>> . First 1to 2 hours it works fine without issues, later it drops down
>> drastically to 30 TPS. What are the things I need to look into in such a
>> situation? There are no exceptions caught . how to check the bottle neck
>> area . can some throw some light on this.
>>
>>
>> Thanks & Regards
>> Ragini Manjaiah
>>
>>

Re: flink job : TPS drops from 400 to 30 TPS

Posted by JING ZHANG <be...@gmail.com>.

Hi,
Since there is not enough information, you could first check the back
pressure status of the job [1], find the task which caused the back
pressure.
Then try to find out why the task processed data slowly, there are many
reasons, for example the following reasons:
(1) Does data skew exist, which means some tasks processed more input data
than the other tasks?
(2) Is the CPU cost very high?
(3) Does RPC response start to slow down？
(4) If you choose async mode lookup, the LookupJoin operator needs to
buffer some data into state. Which state backend do you use? Does the state
backend work fine?
...

Would you please provide more information about the job, for example back
pressure status, input data distribution, async mode or sync mode lookup.

[1]
https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/monitoring/back_pressure/

Best,
JING ZHANG

Ragini Manjaiah <ra...@gmail.com> 于2021年9月27日周一 下午2:05写道：

> Hi ,
> I have a flink real time job which  processes user records via topic and
> also reading data from hbase acting as a look table . If the look table
> does not contain required metadata then it queries the external db via api
> . First 1to 2 hours it works fine without issues, later it drops down
> drastically to 30 TPS. What are the things I need to look into in such a
> situation? There are no exceptions caught . how to check the bottle neck
> area . can some throw some light on this.
>
>
> Thanks & Regards
> Ragini Manjaiah
>
>