You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Richard Rodseth <rr...@gmail.com> on 2018/02/02 19:27:56 UTC

Usual remedy for "Under Replicated" and "Offline Partitions"

We have a DataDog integration showing some metrics, and for one of our
clusters the above two
values are > 0 and highlighted in red.

What's the usual remedy (Confluient Platform, OSS version) ?

Thanks

Re: Usual remedy for "Under Replicated" and "Offline Partitions"

Posted by Richard Rodseth <rr...@gmail.com>.
Thanks Jeff!

On Fri, Feb 2, 2018 at 11:58 AM, Jeff Widman <je...@jeffwidman.com> wrote:

> This means either the brokers are not healthy (bad hardware) or that the
> replication fetchers can't keep up with the rate of incoming messages.
>
> If the latter, you need to figure out where the latency bottleneck is and
> what your latency SLAs are.
>
> Common sources of latency bottlenecks:
>  - network has slow roundtrip speeds: Increase network speed, or increase
> bytes per trip, or increase number of simultaneous fetchers, or increase
> the timeout so that the broker has time to fill all the bytes in the fetch
> request...
>  - broker slow disk I/O: increase disk speed, or increase linux page cache
> size
>
> There are JMX metrics that help disambiguate whether the problem is disk vs
> network... unfortunately the Datadog check is lacking many of these,
> something that I've had on my todo list to patch as we also use Datadog at
> my day job.
>
> One other possible problem is when you have a combination of a lot of
> low-volume partitions being replicated in each call along with a couple of
> high-volume partitions... then the broker can take a long time assembling
> the responses because it has to look at each partition, which might add
> only 1 KB, so it takes a long time to hit the 1MB bytes partition... so it
> hits the timeout first. Then it sends a small response, even though you've
> got a handful of partitions that are really hot and will soon be marked as
> not being in sync.
>
> I know this doesn't provide full details, but hopefully it's enough to get
> you pointed in the right direction...
>
> Cheers,
> Jeff
>
>
>
> On Fri, Feb 2, 2018 at 11:27 AM, Richard Rodseth <rr...@gmail.com>
> wrote:
>
> > We have a DataDog integration showing some metrics, and for one of our
> > clusters the above two
> > values are > 0 and highlighted in red.
> >
> > What's the usual remedy (Confluient Platform, OSS version) ?
> >
> > Thanks
> >
>
>
>
> --
>
> *Jeff Widman*
> jeffwidman.com <http://www.jeffwidman.com/> | 740-WIDMAN-J (943-6265)
> <><
>

Re: Usual remedy for "Under Replicated" and "Offline Partitions"

Posted by Jeff Widman <je...@jeffwidman.com>.
This means either the brokers are not healthy (bad hardware) or that the
replication fetchers can't keep up with the rate of incoming messages.

If the latter, you need to figure out where the latency bottleneck is and
what your latency SLAs are.

Common sources of latency bottlenecks:
 - network has slow roundtrip speeds: Increase network speed, or increase
bytes per trip, or increase number of simultaneous fetchers, or increase
the timeout so that the broker has time to fill all the bytes in the fetch
request...
 - broker slow disk I/O: increase disk speed, or increase linux page cache
size

There are JMX metrics that help disambiguate whether the problem is disk vs
network... unfortunately the Datadog check is lacking many of these,
something that I've had on my todo list to patch as we also use Datadog at
my day job.

One other possible problem is when you have a combination of a lot of
low-volume partitions being replicated in each call along with a couple of
high-volume partitions... then the broker can take a long time assembling
the responses because it has to look at each partition, which might add
only 1 KB, so it takes a long time to hit the 1MB bytes partition... so it
hits the timeout first. Then it sends a small response, even though you've
got a handful of partitions that are really hot and will soon be marked as
not being in sync.

I know this doesn't provide full details, but hopefully it's enough to get
you pointed in the right direction...

Cheers,
Jeff



On Fri, Feb 2, 2018 at 11:27 AM, Richard Rodseth <rr...@gmail.com> wrote:

> We have a DataDog integration showing some metrics, and for one of our
> clusters the above two
> values are > 0 and highlighted in red.
>
> What's the usual remedy (Confluient Platform, OSS version) ?
>
> Thanks
>



-- 

*Jeff Widman*
jeffwidman.com <http://www.jeffwidman.com/> | 740-WIDMAN-J (943-6265)
<><