You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Gerard Klijs <ge...@dizzit.com> on 2016/09/16 12:11:58 UTC

Slow machine disrupting the cluster

We just had an interesting issue, luckily this was only on our test cluster.
Because of some reason one of the machines in a cluster became really slow.
Because it was still alive, it stil was the leader for some
topic-partitions. Our mirror maker reads and writes to multiple
topic-partitions on each thread. When committing the offsets this will fail
for the topic-partitions located on the slow machine, because the consumers
have timed out. The data for these topic-partitions will be send over and
over, causing a flood of duplicate messages.
What would be the best way to prevent this in the future. Is there some way
the broker could notice it's performing poorly and shut's off for example?

Re: Slow machine disrupting the cluster

Posted by Gerard Klijs <ge...@dizzit.com>.

It turned out it was een over-provisioned VM. It was eventually solved by
moving the VM to another cluster. He was also not a little slow but
something in the magnitude of 100 times slower. We are now looking for some
metrics to watch and alert in case it gets slow.

On Fri, Sep 16, 2016 at 4:41 PM David Garcia <da...@spiceworks.com> wrote:

> To remediate, you could start another broker, rebalance, and then shut
> down the busted broker.  But, you really should put some monitoring on your
> system (to help diagnose the actual problem).  Datadog has a pretty good
> set of articles for using jmx to do this:
> https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/
>
> There are lots of jmx metrics gathering tools too…such as jmxtrans:
> https://github.com/jmxtrans/jmxtrans
>
> <confluent-plug>
> confluent also offers tooling (such as command center) to help with
> monitoring.
> </confluent-plug>
>
> As far as mirror maker goes, you can play with the consumer/producer
> timeout settings to make sure the process waits long enough for a slow
> machine.
>
> -David
>
> On 9/16/16, 7:11 AM, "Gerard Klijs" <ge...@dizzit.com> wrote:
>
>     We just had an interesting issue, luckily this was only on our test
> cluster.
>     Because of some reason one of the machines in a cluster became really
> slow.
>     Because it was still alive, it stil was the leader for some
>     topic-partitions. Our mirror maker reads and writes to multiple
>     topic-partitions on each thread. When committing the offsets this will
> fail
>     for the topic-partitions located on the slow machine, because the
> consumers
>     have timed out. The data for these topic-partitions will be send over
> and
>     over, causing a flood of duplicate messages.
>     What would be the best way to prevent this in the future. Is there
> some way
>     the broker could notice it's performing poorly and shut's off for
> example?
>
>
>

Re: Slow machine disrupting the cluster

Posted by David Garcia <da...@spiceworks.com>.

To remediate, you could start another broker, rebalance, and then shut down the busted broker. But, you really should put some monitoring on your system (to help diagnose the actual problem). Datadog has a pretty good set of articles for using jmx to do this: https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/

There are lots of jmx metrics gathering tools too…such as jmxtrans: https://github.com/jmxtrans/jmxtrans

<confluent-plug>
confluent also offers tooling (such as command center) to help with monitoring.
</confluent-plug>

As far as mirror maker goes, you can play with the consumer/producer timeout settings to make sure the process waits long enough for a slow machine.

-David

On 9/16/16, 7:11 AM, "Gerard Klijs" <ge...@dizzit.com> wrote:

We just had an interesting issue, luckily this was only on our test cluster.
Because of some reason one of the machines in a cluster became really slow.
Because it was still alive, it stil was the leader for some
topic-partitions. Our mirror maker reads and writes to multiple
topic-partitions on each thread. When committing the offsets this will fail
for the topic-partitions located on the slow machine, because the consumers
have timed out. The data for these topic-partitions will be send over and
over, causing a flood of duplicate messages.
What would be the best way to prevent this in the future. Is there some way
the broker could notice it's performing poorly and shut's off for example?