You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by Fengnan Li <lo...@gmail.com> on 2020/10/30 20:50:48 UTC

Cost Based FairCallQueue latency issue

Hi all,

 

Has someone deployed Cost Based Fair Call Queue in their production cluster? We ran into some RPC queue latency degradation with ~30k-40k rps. I tried to debug but didn’t find anything suspicious. It is worth mentioning there is no memory issue coming with the extra heap usage for storing the call cost.

 

Thanks,

Fengnan


Re: Cost Based FairCallQueue latency issue

Posted by Fengnan Li <lo...@gmail.com>.
Thanks for replying, Chen! There are a lot of contexts about how this so it is probably better so set up some meeting about it. Do you have time this week? I am interested to know in what circumstances you ran into the queue latency issue.

 

Some more context from my side:

I did below debugging to figure out.
Double checked rpc processing time, but didn’t find obvious increase.
Did some flamegraph profiling, but didn’t catch obvious jstack on cost based related area.
Replayed with Dynamometer but was not able to find clear increase in that environment. 
 

Thanks,

Fengnan

 

From: Chen Liang <va...@gmail.com>
Date: Wednesday, November 4, 2020 at 12:08 PM
To: Fengnan Li <lo...@gmail.com>
Cc: Hdfs-dev <hd...@hadoop.apache.org>
Subject: Re: Cost Based FairCallQueue latency issue

 

Hi Fengnan,

 

We had been testing cost based faire call queue internally. We also saw latency increase, and we are trying to debug into this issue as well. Current suspicion is that the way that the metrics were generated might be introducing too much overhead. We are in the process of trying to reproduce this using Dynamometer. If this is something you would be interested in, we can follow up on working together on this issue.

 

Best,

Chen

 

Fengnan Li <lo...@gmail.com> 于2020年10月30日周五 下午1:51写道:

Hi all,



Has someone deployed Cost Based Fair Call Queue in their production cluster? We ran into some RPC queue latency degradation with ~30k-40k rps. I tried to debug but didn’t find anything suspicious. It is worth mentioning there is no memory issue coming with the extra heap usage for storing the call cost.



Thanks,

Fengnan


Re: Cost Based FairCallQueue latency issue

Posted by Chen Liang <va...@gmail.com>.
Hi Fengnan,

We had been testing cost based faire call queue internally. We also saw
latency increase, and we are trying to debug into this issue as well.
Current suspicion is that the way that the metrics were generated might be
introducing too much overhead. We are in the process of trying to reproduce
this using Dynamometer. If this is something you would be interested in, we
can follow up on working together on this issue.

Best,
Chen

Fengnan Li <lo...@gmail.com> 于2020年10月30日周五 下午1:51写道:

> Hi all,
>
>
>
> Has someone deployed Cost Based Fair Call Queue in their production
> cluster? We ran into some RPC queue latency degradation with ~30k-40k rps.
> I tried to debug but didn’t find anything suspicious. It is worth
> mentioning there is no memory issue coming with the extra heap usage for
> storing the call cost.
>
>
>
> Thanks,
>
> Fengnan
>
>

Re: [E] Cost Based FairCallQueue latency issue

Posted by Fengnan Li <lo...@gmail.com>.
Thanks Daryn,

 

0.01  is just an initial config and it will not exert the penalty to heavy users. We are doing this to just have the code evaluated but not actually using the feature.

The blacklist feature is also another thing further in this direction, meaning heavy users won’t have their calls located in the second queue.

 

With the above two I am evaluating the qtime since it is basically 99% call queue size and 100% handler resource compared to single simple queue. That’s why I don’t understand the qtime diff.

 

I have heard the lock time metrics might be one issue, did you notice that call taking long?

 

Fengnan

 

From: Daryn Sharp <da...@verizonmedia.com>
Date: Thursday, November 5, 2020 at 10:58 AM
To: Fengnan Li <lo...@gmail.com>
Cc: Hdfs-dev <hd...@hadoop.apache.org>
Subject: Re: [E] Cost Based FairCallQueue latency issue

 

We're internally running the patch I submitted on HDFS-14403 which was subsequently modified by other ppl in the community, so it's possible the community flavor may behave differently.  I vaguely remember the RpcMetrics timeunit was changed from micros to millis.  Measuring in millis has meaningless precision.

 

WeightedTimeCostProvider is what enables the feature.  The blacklist is a different feature so if twiddling that conf caused noticeably latency differences then I'd suggest examining that change.

 

I don't think you are going to see much benefit from 2 queues with a .01 decay factor.  I'd suggest at least 4 queues with 0.5 decay so users generating heavy load don't keep popping back up in priority so quickly.

 

 

 

On Thu, Nov 5, 2020 at 11:43 AM Fengnan Li <lo...@gmail.com> wrote:

Thanks for the response Daryn!

 

I agree with you that for the overall average qtime it will increase due to the penalty FCQ brings to the heavy users. However, in our environment, out of the same consideration I intentionally turned off the Call selection between queues. i.e. the cost is calculated as usual, but all users are stayed in the first queue. This is to avoid the overall impact. 

Here are our configs, the red one is what I added for internal use to turn on this feature (making only selected users are actually added into the second queue when their cost reaches threshold).

 

There are two patches for Cost Based FCQ. https://issues.apache.org/jira/browse/HADOOP-16266 and https://issues.apache.org/jira/browse/HDFS-14667. Which version are you using? 

I am right now trying to debug one by one.

 

Thanks,
Fengnan

 

<property>

    <name>ipc.8020.callqueue.capacity.weights</name>

    <value>99,1</value>

  </property>

  <property>

    <name>ipc.8020.callqueue.impl</name>

    <value>org.apache.hadoop.ipc.FairCallQueue</value>

  </property>

  <property>

    <name>ipc.8020.cost-provider.impl</name>

    <value>org.apache.hadoop.ipc.WeightedTimeCostProvider</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.blacklisted.users.enabled</name>

    <value>true</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.decay-factor</name>

    <value>0.01</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.period-ms</name>

    <value>20000</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.thresholds</name>

    <value>15</value>

  </property>

  <property>

    <name>ipc.8020.faircallqueue.multiplexer.weights</name>

    <value>99,1</value>

  </property>

  <property>

    <name>ipc.8020.scheduler.priority.levels</name>

    <value>2</value>

  </property>

 

From: Daryn Sharp <da...@verizonmedia.com>
Date: Thursday, November 5, 2020 at 9:19 AM
To: Fengnan Li <lo...@gmail.com>
Cc: Hdfs-dev <hd...@hadoop.apache.org>
Subject: Re: [E] Cost Based FairCallQueue latency issue

 

I submitted the original 2.8 cost-based FCQ patch (thanks to community members for porting to other branches).  We've been running with it since early 2019 on all clusters.  Multiple clusters run at a baseline of ~30k+ ops/sec with some bursting over 100k ops/sec.  

 

If you are looking at the overall average qtime, yes, that metric is expected to increase and means it's working as designed.  De-prioritizing write heavy users will naturally result in increased qtime for those calls.  Within a bucket, call N's qtime is the sum of the qtime+processing for the prior 0..N-1 calls.  This will appear very high for congested low priority buckets receiving a fraction of the service rate and skew the overall average.

 

 

On Fri, Oct 30, 2020 at 3:51 PM Fengnan Li <lo...@gmail.com> wrote:

Hi all,



Has someone deployed Cost Based Fair Call Queue in their production cluster? We ran into some RPC queue latency degradation with ~30k-40k rps. I tried to debug but didn’t find anything suspicious. It is worth mentioning there is no memory issue coming with the extra heap usage for storing the call cost.



Thanks,

Fengnan


Re: [E] Cost Based FairCallQueue latency issue

Posted by Fengnan Li <lo...@gmail.com>.
Hi Daryn,

 

A slightly related question is that have you used to -refreshCallQueue to tune config for the fair call queue instead of the normal maintenance(failover + restart)?  If so how is the performance impact?

 

Thanks,
Fengnan

 

From: Daryn Sharp <da...@verizonmedia.com>
Date: Thursday, November 5, 2020 at 10:58 AM
To: Fengnan Li <lo...@gmail.com>
Cc: Hdfs-dev <hd...@hadoop.apache.org>
Subject: Re: [E] Cost Based FairCallQueue latency issue

 

We're internally running the patch I submitted on HDFS-14403 which was subsequently modified by other ppl in the community, so it's possible the community flavor may behave differently.  I vaguely remember the RpcMetrics timeunit was changed from micros to millis.  Measuring in millis has meaningless precision.

 

WeightedTimeCostProvider is what enables the feature.  The blacklist is a different feature so if twiddling that conf caused noticeably latency differences then I'd suggest examining that change.

 

I don't think you are going to see much benefit from 2 queues with a .01 decay factor.  I'd suggest at least 4 queues with 0.5 decay so users generating heavy load don't keep popping back up in priority so quickly.

 

 

 

On Thu, Nov 5, 2020 at 11:43 AM Fengnan Li <lo...@gmail.com> wrote:

Thanks for the response Daryn!

 

I agree with you that for the overall average qtime it will increase due to the penalty FCQ brings to the heavy users. However, in our environment, out of the same consideration I intentionally turned off the Call selection between queues. i.e. the cost is calculated as usual, but all users are stayed in the first queue. This is to avoid the overall impact. 

Here are our configs, the red one is what I added for internal use to turn on this feature (making only selected users are actually added into the second queue when their cost reaches threshold).

 

There are two patches for Cost Based FCQ. https://issues.apache.org/jira/browse/HADOOP-16266 and https://issues.apache.org/jira/browse/HDFS-14667. Which version are you using? 

I am right now trying to debug one by one.

 

Thanks,
Fengnan

 

<property>

    <name>ipc.8020.callqueue.capacity.weights</name>

    <value>99,1</value>

  </property>

  <property>

    <name>ipc.8020.callqueue.impl</name>

    <value>org.apache.hadoop.ipc.FairCallQueue</value>

  </property>

  <property>

    <name>ipc.8020.cost-provider.impl</name>

    <value>org.apache.hadoop.ipc.WeightedTimeCostProvider</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.blacklisted.users.enabled</name>

    <value>true</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.decay-factor</name>

    <value>0.01</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.period-ms</name>

    <value>20000</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.thresholds</name>

    <value>15</value>

  </property>

  <property>

    <name>ipc.8020.faircallqueue.multiplexer.weights</name>

    <value>99,1</value>

  </property>

  <property>

    <name>ipc.8020.scheduler.priority.levels</name>

    <value>2</value>

  </property>

 

From: Daryn Sharp <da...@verizonmedia.com>
Date: Thursday, November 5, 2020 at 9:19 AM
To: Fengnan Li <lo...@gmail.com>
Cc: Hdfs-dev <hd...@hadoop.apache.org>
Subject: Re: [E] Cost Based FairCallQueue latency issue

 

I submitted the original 2.8 cost-based FCQ patch (thanks to community members for porting to other branches).  We've been running with it since early 2019 on all clusters.  Multiple clusters run at a baseline of ~30k+ ops/sec with some bursting over 100k ops/sec.  

 

If you are looking at the overall average qtime, yes, that metric is expected to increase and means it's working as designed.  De-prioritizing write heavy users will naturally result in increased qtime for those calls.  Within a bucket, call N's qtime is the sum of the qtime+processing for the prior 0..N-1 calls.  This will appear very high for congested low priority buckets receiving a fraction of the service rate and skew the overall average.

 

 

On Fri, Oct 30, 2020 at 3:51 PM Fengnan Li <lo...@gmail.com> wrote:

Hi all,



Has someone deployed Cost Based Fair Call Queue in their production cluster? We ran into some RPC queue latency degradation with ~30k-40k rps. I tried to debug but didn’t find anything suspicious. It is worth mentioning there is no memory issue coming with the extra heap usage for storing the call cost.



Thanks,

Fengnan


Re: [E] Cost Based FairCallQueue latency issue

Posted by Daryn Sharp <da...@verizonmedia.com.INVALID>.
We're internally running the patch I submitted on HDFS-14403 which was
subsequently modified by other ppl in the community, so it's possible the
community flavor may behave differently.  I vaguely remember the RpcMetrics
timeunit was changed from micros to millis.  Measuring in millis has
meaningless precision.

WeightedTimeCostProvider is what enables the feature.  The blacklist is a
different feature so if twiddling that conf caused noticeably latency
differences then I'd suggest examining that change.

I don't think you are going to see much benefit from 2 queues with a .01
decay factor.  I'd suggest at least 4 queues with 0.5 decay so users
generating heavy load don't keep popping back up in priority so quickly.



On Thu, Nov 5, 2020 at 11:43 AM Fengnan Li <lo...@gmail.com> wrote:

> Thanks for the response Daryn!
>
>
>
> I agree with you that for the overall average qtime it will increase due
> to the penalty FCQ brings to the heavy users. However, in our environment,
> out of the same consideration I intentionally turned off the Call selection
> between queues. i.e. the cost is calculated as usual, but all users are
> stayed in the first queue. This is to avoid the overall impact.
>
> Here are our configs, the red one is what I added for internal use to turn
> on this feature (making only selected users are actually added into the
> second queue when their cost reaches threshold).
>
>
>
> There are two patches for Cost Based FCQ.
> https://issues.apache.org/jira/browse/HADOOP-16266
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D16266&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=YvobliAaIb5Zx_GAaDndHMzwchT42yLkTDhO4iJb4wY&m=GltLj2InqpUfpDOK5OWdPkOED62df3FmzR-OjNuIt5A&s=JdIi9kZN2CIGkM7HEOjAugCdo-727sbVkXOHPm0c2NM&e=>
> and https://issues.apache.org/jira/browse/HDFS-14667
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D14667&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=YvobliAaIb5Zx_GAaDndHMzwchT42yLkTDhO4iJb4wY&m=GltLj2InqpUfpDOK5OWdPkOED62df3FmzR-OjNuIt5A&s=Ef3nub-nzm7zqOSfZzO888h4JKoSa_6qdlULq6SXk6U&e=>.
> Which version are you using?
>
> I am right now trying to debug one by one.
>
>
>
> Thanks,
> Fengnan
>
>
>
> <property>
>
>     <name>ipc.8020.callqueue.capacity.weights</name>
>
>     <value>99,1</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.callqueue.impl</name>
>
>     <value>org.apache.hadoop.ipc.FairCallQueue</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.cost-provider.impl</name>
>
>     <value>org.apache.hadoop.ipc.WeightedTimeCostProvider</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.decay-scheduler.blacklisted.users.enabled</name>
>
>     <value>true</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.decay-scheduler.decay-factor</name>
>
>     <value>0.01</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.decay-scheduler.period-ms</name>
>
>     <value>20000</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.decay-scheduler.thresholds</name>
>
>     <value>15</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.faircallqueue.multiplexer.weights</name>
>
>     <value>99,1</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.scheduler.priority.levels</name>
>
>     <value>2</value>
>
>   </property>
>
>
>
> *From: *Daryn Sharp <da...@verizonmedia.com>
> *Date: *Thursday, November 5, 2020 at 9:19 AM
> *To: *Fengnan Li <lo...@gmail.com>
> *Cc: *Hdfs-dev <hd...@hadoop.apache.org>
> *Subject: *Re: [E] Cost Based FairCallQueue latency issue
>
>
>
> I submitted the original 2.8 cost-based FCQ patch (thanks to community
> members for porting to other branches).  We've been running with it since
> early 2019 on all clusters.  Multiple clusters run at a baseline of ~30k+
> ops/sec with some bursting over 100k ops/sec.
>
>
>
> If you are looking at the overall average qtime, yes, that metric is
> expected to increase and means it's working as designed.  De-prioritizing
> write heavy users will naturally result in increased qtime for those
> calls.  Within a bucket, call N's qtime is the sum of the qtime+processing
> for the prior 0..N-1 calls.  This will appear very high for congested low
> priority buckets receiving a fraction of the service rate and skew the
> overall average.
>
>
>
>
>
> On Fri, Oct 30, 2020 at 3:51 PM Fengnan Li <lo...@gmail.com> wrote:
>
> Hi all,
>
>
>
> Has someone deployed Cost Based Fair Call Queue in their production
> cluster? We ran into some RPC queue latency degradation with ~30k-40k rps.
> I tried to debug but didn’t find anything suspicious. It is worth
> mentioning there is no memory issue coming with the extra heap usage for
> storing the call cost.
>
>
>
> Thanks,
>
> Fengnan
>
>

Re: [E] Cost Based FairCallQueue latency issue

Posted by Jim Brennan <ja...@verizonmedia.com.INVALID>.
Note that I have a Jira up right now for a bug that Daryn found while
testing FCQ internally. Not sure if it is relevant to what you are seeing.
https://issues.apache.org/jira/browse/HADOOP-17342

Jim

On Thu, Nov 5, 2020 at 11:43 AM Fengnan Li <lo...@gmail.com> wrote:

> Thanks for the response Daryn!
>
>
>
> I agree with you that for the overall average qtime it will increase due
> to the penalty FCQ brings to the heavy users. However, in our environment,
> out of the same consideration I intentionally turned off the Call selection
> between queues. i.e. the cost is calculated as usual, but all users are
> stayed in the first queue. This is to avoid the overall impact.
>
> Here are our configs, the red one is what I added for internal use to turn
> on this feature (making only selected users are actually added into the
> second queue when their cost reaches threshold).
>
>
>
> There are two patches for Cost Based FCQ.
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D16266&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=7Imi06B91L3gbxmt5ChzH4cwlA2_f2tmXh3OXmV9MLw&m=MRbxfAaGc3E9KDIULflSET3ADaNAEf_zK1HtQtYpGZw&s=kR8TBw2cljp6qmNzUNSV8LPz8imJVs2fPmhW7NWa98Q&e=
> and
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D14667&d=DwIFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=7Imi06B91L3gbxmt5ChzH4cwlA2_f2tmXh3OXmV9MLw&m=MRbxfAaGc3E9KDIULflSET3ADaNAEf_zK1HtQtYpGZw&s=6qdKJfpUuWFFqmnXqcS5PHwRXiJiz1xSt8RaPJgw6WA&e=
> . Which version are you using?
>
> I am right now trying to debug one by one.
>
>
>
> Thanks,
> Fengnan
>
>
>
> <property>
>
>     <name>ipc.8020.callqueue.capacity.weights</name>
>
>     <value>99,1</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.callqueue.impl</name>
>
>     <value>org.apache.hadoop.ipc.FairCallQueue</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.cost-provider.impl</name>
>
>     <value>org.apache.hadoop.ipc.WeightedTimeCostProvider</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.decay-scheduler.blacklisted.users.enabled</name>
>
>     <value>true</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.decay-scheduler.decay-factor</name>
>
>     <value>0.01</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.decay-scheduler.period-ms</name>
>
>     <value>20000</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.decay-scheduler.thresholds</name>
>
>     <value>15</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.faircallqueue.multiplexer.weights</name>
>
>     <value>99,1</value>
>
>   </property>
>
>   <property>
>
>     <name>ipc.8020.scheduler.priority.levels</name>
>
>     <value>2</value>
>
>   </property>
>
>
>
> From: Daryn Sharp <da...@verizonmedia.com>
> Date: Thursday, November 5, 2020 at 9:19 AM
> To: Fengnan Li <lo...@gmail.com>
> Cc: Hdfs-dev <hd...@hadoop.apache.org>
> Subject: Re: [E] Cost Based FairCallQueue latency issue
>
>
>
> I submitted the original 2.8 cost-based FCQ patch (thanks to community
> members for porting to other branches).  We've been running with it since
> early 2019 on all clusters.  Multiple clusters run at a baseline of ~30k+
> ops/sec with some bursting over 100k ops/sec.
>
>
>
> If you are looking at the overall average qtime, yes, that metric is
> expected to increase and means it's working as designed.  De-prioritizing
> write heavy users will naturally result in increased qtime for those
> calls.  Within a bucket, call N's qtime is the sum of the qtime+processing
> for the prior 0..N-1 calls.  This will appear very high for congested low
> priority buckets receiving a fraction of the service rate and skew the
> overall average.
>
>
>
>
>
> On Fri, Oct 30, 2020 at 3:51 PM Fengnan Li <lo...@gmail.com> wrote:
>
> Hi all,
>
>
>
> Has someone deployed Cost Based Fair Call Queue in their production
> cluster? We ran into some RPC queue latency degradation with ~30k-40k rps.
> I tried to debug but didn’t find anything suspicious. It is worth
> mentioning there is no memory issue coming with the extra heap usage for
> storing the call cost.
>
>
>
> Thanks,
>
> Fengnan
>
>

Re: [E] Cost Based FairCallQueue latency issue

Posted by Fengnan Li <lo...@gmail.com>.
Thanks for the response Daryn!

 

I agree with you that for the overall average qtime it will increase due to the penalty FCQ brings to the heavy users. However, in our environment, out of the same consideration I intentionally turned off the Call selection between queues. i.e. the cost is calculated as usual, but all users are stayed in the first queue. This is to avoid the overall impact. 

Here are our configs, the red one is what I added for internal use to turn on this feature (making only selected users are actually added into the second queue when their cost reaches threshold).

 

There are two patches for Cost Based FCQ. https://issues.apache.org/jira/browse/HADOOP-16266 and https://issues.apache.org/jira/browse/HDFS-14667. Which version are you using? 

I am right now trying to debug one by one.

 

Thanks,
Fengnan

 

<property>

    <name>ipc.8020.callqueue.capacity.weights</name>

    <value>99,1</value>

  </property>

  <property>

    <name>ipc.8020.callqueue.impl</name>

    <value>org.apache.hadoop.ipc.FairCallQueue</value>

  </property>

  <property>

    <name>ipc.8020.cost-provider.impl</name>

    <value>org.apache.hadoop.ipc.WeightedTimeCostProvider</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.blacklisted.users.enabled</name>

    <value>true</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.decay-factor</name>

    <value>0.01</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.period-ms</name>

    <value>20000</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.thresholds</name>

    <value>15</value>

  </property>

  <property>

    <name>ipc.8020.faircallqueue.multiplexer.weights</name>

    <value>99,1</value>

  </property>

  <property>

    <name>ipc.8020.scheduler.priority.levels</name>

    <value>2</value>

  </property>

 

From: Daryn Sharp <da...@verizonmedia.com>
Date: Thursday, November 5, 2020 at 9:19 AM
To: Fengnan Li <lo...@gmail.com>
Cc: Hdfs-dev <hd...@hadoop.apache.org>
Subject: Re: [E] Cost Based FairCallQueue latency issue

 

I submitted the original 2.8 cost-based FCQ patch (thanks to community members for porting to other branches).  We've been running with it since early 2019 on all clusters.  Multiple clusters run at a baseline of ~30k+ ops/sec with some bursting over 100k ops/sec.  

 

If you are looking at the overall average qtime, yes, that metric is expected to increase and means it's working as designed.  De-prioritizing write heavy users will naturally result in increased qtime for those calls.  Within a bucket, call N's qtime is the sum of the qtime+processing for the prior 0..N-1 calls.  This will appear very high for congested low priority buckets receiving a fraction of the service rate and skew the overall average.

 

 

On Fri, Oct 30, 2020 at 3:51 PM Fengnan Li <lo...@gmail.com> wrote:

Hi all,



Has someone deployed Cost Based Fair Call Queue in their production cluster? We ran into some RPC queue latency degradation with ~30k-40k rps. I tried to debug but didn’t find anything suspicious. It is worth mentioning there is no memory issue coming with the extra heap usage for storing the call cost.



Thanks,

Fengnan


Re: [E] Cost Based FairCallQueue latency issue

Posted by Daryn Sharp <da...@verizonmedia.com.INVALID>.
I submitted the original 2.8 cost-based FCQ patch (thanks to community
members for porting to other branches).  We've been running with it since
early 2019 on all clusters.  Multiple clusters run at a baseline of ~30k+
ops/sec with some bursting over 100k ops/sec.

If you are looking at the overall average qtime, yes, that metric is
expected to increase and means it's working as designed.  De-prioritizing
write heavy users will naturally result in increased qtime for those
calls.  Within a bucket, call N's qtime is the sum of the qtime+processing
for the prior 0..N-1 calls.  This will appear very high for congested low
priority buckets receiving a fraction of the service rate and skew the
overall average.


On Fri, Oct 30, 2020 at 3:51 PM Fengnan Li <lo...@gmail.com> wrote:

> Hi all,
>
>
>
> Has someone deployed Cost Based Fair Call Queue in their production
> cluster? We ran into some RPC queue latency degradation with ~30k-40k rps.
> I tried to debug but didn’t find anything suspicious. It is worth
> mentioning there is no memory issue coming with the extra heap usage for
> storing the call cost.
>
>
>
> Thanks,
>
> Fengnan
>
>