You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@druid.apache.org by Dylan Wylie <dy...@apache.org> on 2020/05/06 17:45:49 UTC

Segment Load Contention

Hey folks,

Discovered recently that when taking historicals down for maintenance that
we get pretty significant query latency spikes across that node's tier.

These spikes seem be related to contention from ZKCoordinator threads
unzipping segments from deep storage to replace those from the stopped
historical.

The default value for druid.segmentCache.numLoadingThreads is the number of
cores on the host. I haven't done any detailed profiling to be sure but
intuitively it seems like lower default might be safer to avoid contending
with query workloads, at least setting it to a much lower values looks to
have fixed our probllem.

Maybe I've only noticed it because there's something unique in our setup,
so I'm curious if anyone else has experienced something similar? Or even
it'd be interesting if anyone can confirm that they don't see an impact on
latency when nodes are taken down with the value for the config set to its
default.

(This is all on 0.16.1 btw, I haven't tried to replicate it on a newer
version yet)

Best regards,
Dylan

Re: Segment Load Contention

Posted by Samarth Jain <sa...@gmail.com>.

Hi Dylan,

I think it does make sense to lower the default value of the config. May be
min(1, number_of_proceesors/6) ?
We had to lower the number in our environment as well.
I actually had it configured to 2 * number of processors and it resulted in
so high a contention
that things just halted. See this slack thread:
https://the-asf.slack.com/archives/CJ8D1JTB8/p1579135549106100

I eventually had to dial it down to one sixth of the number of processors
to be safe but still keeping enough download parallelism at the same time.

On Wed, May 6, 2020 at 10:46 AM Dylan Wylie <dy...@apache.org> wrote:

> Hey folks,
>
> Discovered recently that when taking historicals down for maintenance that
> we get pretty significant query latency spikes across that node's tier.
>
> These spikes seem be related to contention from ZKCoordinator threads
> unzipping segments from deep storage to replace those from the stopped
> historical.
>
> The default value for druid.segmentCache.numLoadingThreads is the number of
> cores on the host. I haven't done any detailed profiling to be sure but
> intuitively it seems like lower default might be safer to avoid contending
> with query workloads, at least setting it to a much lower values looks to
> have fixed our probllem.
>
> Maybe I've only noticed it because there's something unique in our setup,
> so I'm curious if anyone else has experienced something similar? Or even
> it'd be interesting if anyone can confirm that they don't see an impact on
> latency when nodes are taken down with the value for the config set to its
> default.
>
> (This is all on 0.16.1 btw, I haven't tried to replicate it on a newer
> version yet)
>
> Best regards,
> Dylan
>