You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Todd S <to...@borked.ca> on 2014/11/04 13:28:56 UTC

Tuning replication

Good day all,

We're running a good sized Kafka cluster, running 0.8.1, and during our
peak traffic times replication falls behind.  I've been doing some reading
about parameters for tuning replication, but I'd love some real world
experience and insight.

Some general questions:

* Does Kafka 'like' lots of small partitions for replication, or larger
ones?  ie: if I'm passing 1Gbps into a topic, will replication be happier
if that's one partition, or many partitions?

* How can we 'up' the priority of replication over other actions?

* What is the most effective way to monitor the replication lag?  On
brokers with hundreds of partitions, the JMX data starts getting very
muddled and plentiful.  I'm trying to find something we can graph/dashboard
to say 'replication is in X state'.  When we look at it in aggregate, we
assume that 'big numbers are further behind', but then sometimes find
negative numbers as well?

We are looking to make sure our cluster is well balanced, but have run into
the problem that we can't move a partition until it's got all its ISRs, but
the box is so overloaded it never catches up, so we can't take any load
off, lather, rinse, repeat.

Ultimately, we need to add even more hardware to the busy clusters, but
that times some time, so I'm hoping we can get some ideas about what we can
tune and improve.

Thanks,

Todd.

Re: Tuning replication

Posted by Todd Palino <tp...@gmail.com>.

I think your answers are pretty spot-on, Joel. Under Replicated Count is
the metric that we monitor to make sure the cluster is healthy. It lets us
know when a broker is down (because all the numbers except one broker are
elevated), or when a broker is struggling (low counts fluctuating across a
few hosts).

As far as lots of small partitions vs. a few large partitions, we prefer
the former. It means we can spread the load out over brokers more evenly.

-Todd

On Tue, Nov 4, 2014 at 10:07 AM, Joel Koshy <jj...@gmail.com> wrote:

> Ops-experts can share more details but here are some comments:
> >
> > * Does Kafka 'like' lots of small partitions for replication, or larger
> > ones?  ie: if I'm passing 1Gbps into a topic, will replication be happier
> > if that's one partition, or many partitions?
>
> Since you also have to account for the NIC utilization by replica
> fetches it is better to split a heavy topic into many partitions.
>
> > * How can we 'up' the priority of replication over other actions?
>
> If you do the above, this should not be necessary but you could
> increase the number of replica fetchers. (num.replica.fetchers)
>
> > * What is the most effective way to monitor the replication lag?  On
> > brokers with hundreds of partitions, the JMX data starts getting very
> > muddled and plentiful.  I'm trying to find something we can
> graph/dashboard
> > to say 'replication is in X state'.  When we look at it in aggregate, we
> > assume that 'big numbers are further behind', but then sometimes find
> > negative numbers as well?
>
> The easiest mbean to look at is the underreplicated partition count.
> This is at the broker-level so it is coarse-grained. If it is > 0 you
> can use various tools to do mbean queries to figure out which
> partition is lagging behind. Another thing you can look at is the ISR
> shrink/expand rate. If you see a lot of churn you may need to tune the
> settings that affect ISR maintenance (replica.lag.time.max.ms,
> replica.lag.max.messages).
>
>
> --
> Joel
>

Re: Tuning replication

Posted by Todd Palino <tp...@gmail.com>.

We have our threshold for under replicated set at anything over 2. The
reason we picked that number is because we have a cluster that tends to
take very high traffic for short periods of time, and 2 gets us around the
false positives (with a careful balance of the partitions in the cluster).
We're also holding ourselves to a fairly strict standard, so whenever we
see URP for any reason, we're investigating what's going on and resolving
it so it doesn't happen again.

Technically, we're supposed to be called for any URP alert. In reality, we
don't have any in normal operation unless we have a problem like a down
broker. If replicas are falling behind due to network congestion (or other
resource exhaustion), we balance things out, expand the cluster, or find
our problem producer or consumer and fix them.

-Todd

On Tue, Nov 4, 2014 at 12:13 PM, Todd S <to...@borked.ca> wrote:

> Joel,
>
> Thanks for your input - it fits what I was thinking, so it's good
> confirmation.
>
> > The easiest mbean to look at is the underreplicated partition count.
> > This is at the broker-level so it is coarse-grained. If it is > 0 you
> > can use various tools to do mbean queries to figure out which
> > partition is lagging behind. Another thing you can look at is the ISR
> > shrink/expand rate. If you see a lot of churn you may need to tune the
> > settings that affect ISR maintenance (replica.lag.time.max.ms,
> > replica.lag.max.messages).
>
> and Todd Palino said:
>
> > Under Replicated Count is the metric that we monitor to make sure the
> > cluster is healthy.
>
> We report/alert on under replicated partitions.  what i'm trying to do
> is get away from event driven alerts to the NOC/ops people, and give
> them something qualitative (replication is {ok|a little
> behind|behind|really behind|really really behind|oh no we're doomed}
> so we know how to respond appropriately.  I don't really want ops
> folks getting called at 2am on a Saturday because a single replica is
> behind by a few thousand messages .. however I *do* want someone
> called if we're a billion messages behind.
>
> If I look at
> 'KAFKA|kafka.server|FetcherLagMetrics|ReplicaFetcherThread-.*:Value'
> , can I use that as my measure of badness/behindness?
>
>
> In a similar vein, at what point do you/Todd/others wake someone up?
> How many replicas out of sync, by how much?  What is the major concern
> point, vs 'meh, it'll catch up soon'?  I know it's likely different
> between different environments, but as I'm new to this, I'd love to
> know how others see things.
>
> Thanks!
>

Re: Tuning replication

Posted by Todd S <to...@borked.ca>.

Joel,

Thanks for your input - it fits what I was thinking, so it's good confirmation.

> The easiest mbean to look at is the underreplicated partition count.
> This is at the broker-level so it is coarse-grained. If it is > 0 you
> can use various tools to do mbean queries to figure out which
> partition is lagging behind. Another thing you can look at is the ISR
> shrink/expand rate. If you see a lot of churn you may need to tune the
> settings that affect ISR maintenance (replica.lag.time.max.ms,
> replica.lag.max.messages).

and Todd Palino said:

> Under Replicated Count is the metric that we monitor to make sure the
> cluster is healthy.

We report/alert on under replicated partitions.  what i'm trying to do
is get away from event driven alerts to the NOC/ops people, and give
them something qualitative (replication is {ok|a little
behind|behind|really behind|really really behind|oh no we're doomed}
so we know how to respond appropriately.  I don't really want ops
folks getting called at 2am on a Saturday because a single replica is
behind by a few thousand messages .. however I *do* want someone
called if we're a billion messages behind.

If I look at  'KAFKA|kafka.server|FetcherLagMetrics|ReplicaFetcherThread-.*:Value'
, can I use that as my measure of badness/behindness?


In a similar vein, at what point do you/Todd/others wake someone up?
How many replicas out of sync, by how much?  What is the major concern
point, vs 'meh, it'll catch up soon'?  I know it's likely different
between different environments, but as I'm new to this, I'd love to
know how others see things.

Thanks!

Re: Tuning replication

Posted by Joel Koshy <jj...@gmail.com>.

Ops-experts can share more details but here are some comments:
> 
> * Does Kafka 'like' lots of small partitions for replication, or larger
> ones?  ie: if I'm passing 1Gbps into a topic, will replication be happier
> if that's one partition, or many partitions?

Since you also have to account for the NIC utilization by replica
fetches it is better to split a heavy topic into many partitions.

> * How can we 'up' the priority of replication over other actions?

If you do the above, this should not be necessary but you could
increase the number of replica fetchers. (num.replica.fetchers)

> * What is the most effective way to monitor the replication lag?  On
> brokers with hundreds of partitions, the JMX data starts getting very
> muddled and plentiful.  I'm trying to find something we can graph/dashboard
> to say 'replication is in X state'.  When we look at it in aggregate, we
> assume that 'big numbers are further behind', but then sometimes find
> negative numbers as well?

The easiest mbean to look at is the underreplicated partition count.
This is at the broker-level so it is coarse-grained. If it is > 0 you
can use various tools to do mbean queries to figure out which
partition is lagging behind. Another thing you can look at is the ISR
shrink/expand rate. If you see a lot of churn you may need to tune the
settings that affect ISR maintenance (replica.lag.time.max.ms,
replica.lag.max.messages).


-- 
Joel