You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kudu.apache.org by Todd Lipcon <to...@cloudera.com> on 2020/04/20 16:37:36 UTC

Re: Implications/downside of increasing rpc_service_queue_length

Hi Mauricio,

Sorry for the late reply on this one. Hope "better late than never" is the
case here :)

As you implied in your email, the main issue with increasing queue length
to deal with queue overflows is that it only helps with momentary spikes.
According to queueing theory (and intuition) if the rate of arrival of
entries into a queue is faster than the rate of processing items in that
queue, then the queue length will grow. If this is a transient phenomenon
(eg a quick burst of requests) then having a larger queue capacity will
prevent overflows, but if this is a persistent phenomenon, then there is no
length of queue that is sufficient to prevent overflows. The one exception
here is that if the number of potential concurrent queue entries is itself
bounded (eg because there is a bounded number of clients).

According to the above theory, the philosophy behind the default short
queue is that longer queues aren't a real solution if the cluster is
overloaded. That said, if you think that the issues are just transient
spikes rather than a capacity overload, it's possible that bumping the
queue length (eg to 100) can help here.

In terms of things to be aware of: having a longer queue means that the
amount of memory taken by entries in the queue is increased proportionally.
Currenlty, that memory is not tracked as part of Kudu's Memtracker
infrastructure, but it does get accounted for in the global heap and can
push the serve into "memory pressure" mode where requests will start
getting rejected, rowsets will get flushed, etc. I would recommend that if
you increase your queues you make sure that you have a relatively larger
memory limit allocated to your tablet servers and watch out for log
messages and metrics indicating persistent memory pressure (particularly in
the 80%+ range where things start getting dropped a lot).

Long queues are also potentially an issue in terms of low-latency requests.
The longer the queue (in terms of items) the longer the latency of elements
waiting in that queue. If you have some element of latency SLAs, you should
monitor them closely as you change queue length configuration.

Hope that helps

-Todd

Re: Implications/downside of increasing rpc_service_queue_length

Posted by Alexey Serbin <as...@cloudera.com>.

I guess the point about the low-latency requests was that long RPC queues
might add extra latency to request handling, and the latency might be
unpredictably long.  E.g., if the queue is almost full and a new RPC
request is added, the request will be dispatched to one of the available
service threads only after dispatching already enqueued ones.  And the
number of service threads in the service thread pool is limited.


Thanks,

Alexey

On Thu, Apr 30, 2020 at 11:17 AM Mauricio Aristizabal <ma...@impact.com>
wrote:

> Thanks Todd. Better late than never indeed, appreciate it very much.
>
> Yes, precisely, we are dealing with very spikey ingest.
>
> Immediate issue has been addressed though: we extended the spark
> KuduContext so we could build our own AsyncKuduClient and
> increase defaultOperationTimeoutMs from default 30s to 120s and that has
> eliminated the client timeouts.
>
> One followup question: not sure I understand your comment re/ low-latency
> requests - if data was ingested it is already in MemStore and therefore
> available to clients, so whether queued or not, it should not make a
> difference on data availability right? except maybe slow down scans/queries
> a bit since they have to read more data from MemStore and uncompacted
> RowStores?
>
> thanks again,
>
> -m
>
> On Mon, Apr 20, 2020 at 9:38 AM Todd Lipcon <to...@cloudera.com> wrote:
>
>> Hi Mauricio,
>>
>> Sorry for the late reply on this one. Hope "better late than never" is
>> the case here :)
>>
>> As you implied in your email, the main issue with increasing queue length
>> to deal with queue overflows is that it only helps with momentary spikes.
>> According to queueing theory (and intuition) if the rate of arrival of
>> entries into a queue is faster than the rate of processing items in that
>> queue, then the queue length will grow. If this is a transient phenomenon
>> (eg a quick burst of requests) then having a larger queue capacity will
>> prevent overflows, but if this is a persistent phenomenon, then there is no
>> length of queue that is sufficient to prevent overflows. The one exception
>> here is that if the number of potential concurrent queue entries is itself
>> bounded (eg because there is a bounded number of clients).
>>
>> According to the above theory, the philosophy behind the default short
>> queue is that longer queues aren't a real solution if the cluster is
>> overloaded. That said, if you think that the issues are just transient
>> spikes rather than a capacity overload, it's possible that bumping the
>> queue length (eg to 100) can help here.
>>
>> In terms of things to be aware of: having a longer queue means that the
>> amount of memory taken by entries in the queue is increased proportionally.
>> Currenlty, that memory is not tracked as part of Kudu's Memtracker
>> infrastructure, but it does get accounted for in the global heap and can
>> push the serve into "memory pressure" mode where requests will start
>> getting rejected, rowsets will get flushed, etc. I would recommend that if
>> you increase your queues you make sure that you have a relatively larger
>> memory limit allocated to your tablet servers and watch out for log
>> messages and metrics indicating persistent memory pressure (particularly in
>> the 80%+ range where things start getting dropped a lot).
>>
>> Long queues are also potentially an issue in terms of low-latency
>> requests. The longer the queue (in terms of items) the longer the latency
>> of elements waiting in that queue. If you have some element of latency
>> SLAs, you should monitor them closely as you change queue length
>> configuration.
>>
>> Hope that helps
>>
>> -Todd
>>
>>
>
> --
> Mauricio Aristizabal
> Architect - Data Pipeline
> mauricio@impact.com | 323 309 4260
> https://impact.com
> <https://www.linkedin.com/company/impact-partech/>
> <https://www.facebook.com/ImpactParTech/>
> <https://twitter.com/impactpartech>
> <https://www.youtube.com/c/impactpartech>
>
> <https://go.impact.com/WB-PC-AW-Navigating-Partner-Marketing-Strategy-During-Covid-19.html>
>

Re: Implications/downside of increasing rpc_service_queue_length

Posted by Mauricio Aristizabal <ma...@impact.com>.

Thanks Todd. Better late than never indeed, appreciate it very much.

Yes, precisely, we are dealing with very spikey ingest.

Immediate issue has been addressed though: we extended the spark
KuduContext so we could build our own AsyncKuduClient and
increase defaultOperationTimeoutMs from default 30s to 120s and that has
eliminated the client timeouts.

One followup question: not sure I understand your comment re/ low-latency
requests - if data was ingested it is already in MemStore and therefore
available to clients, so whether queued or not, it should not make a
difference on data availability right? except maybe slow down scans/queries
a bit since they have to read more data from MemStore and uncompacted
RowStores?

thanks again,

-m

On Mon, Apr 20, 2020 at 9:38 AM Todd Lipcon <to...@cloudera.com> wrote:

> Hi Mauricio,
>
> Sorry for the late reply on this one. Hope "better late than never" is the
> case here :)
>
> As you implied in your email, the main issue with increasing queue length
> to deal with queue overflows is that it only helps with momentary spikes.
> According to queueing theory (and intuition) if the rate of arrival of
> entries into a queue is faster than the rate of processing items in that
> queue, then the queue length will grow. If this is a transient phenomenon
> (eg a quick burst of requests) then having a larger queue capacity will
> prevent overflows, but if this is a persistent phenomenon, then there is no
> length of queue that is sufficient to prevent overflows. The one exception
> here is that if the number of potential concurrent queue entries is itself
> bounded (eg because there is a bounded number of clients).
>
> According to the above theory, the philosophy behind the default short
> queue is that longer queues aren't a real solution if the cluster is
> overloaded. That said, if you think that the issues are just transient
> spikes rather than a capacity overload, it's possible that bumping the
> queue length (eg to 100) can help here.
>
> In terms of things to be aware of: having a longer queue means that the
> amount of memory taken by entries in the queue is increased proportionally.
> Currenlty, that memory is not tracked as part of Kudu's Memtracker
> infrastructure, but it does get accounted for in the global heap and can
> push the serve into "memory pressure" mode where requests will start
> getting rejected, rowsets will get flushed, etc. I would recommend that if
> you increase your queues you make sure that you have a relatively larger
> memory limit allocated to your tablet servers and watch out for log
> messages and metrics indicating persistent memory pressure (particularly in
> the 80%+ range where things start getting dropped a lot).
>
> Long queues are also potentially an issue in terms of low-latency
> requests. The longer the queue (in terms of items) the longer the latency
> of elements waiting in that queue. If you have some element of latency
> SLAs, you should monitor them closely as you change queue length
> configuration.
>
> Hope that helps
>
> -Todd
>
>

-- 
Mauricio Aristizabal
Architect - Data Pipeline
mauricio@impact.com | 323 309 4260
https://impact.com
<https://www.linkedin.com/company/impact-partech/>
<https://www.facebook.com/ImpactParTech/>
<https://twitter.com/impactpartech>
<https://www.youtube.com/c/impactpartech>
<https://go.impact.com/WB-PC-AW-Navigating-Partner-Marketing-Strategy-During-Covid-19.html>