You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Viktor Somogyi-Vass <vi...@gmail.com> on 2019/02/22 11:10:16 UTC

[DISCUSS] KIP-434: Add Replica Fetcher and Log Cleaner Count Metrics

Hi All,

I'd like to start a discussion about exposing count gauge metrics for the
replica fetcher and log cleaner thread counts. It isn't a long KIP and the
motivation is very simple: monitoring the thread counts in these cases
would help with the investigation of various issues and might help in
preventing more serious issues when a broker is in a bad state. Such a
scenario that we seen with users is that their disk fills up as the log
cleaner died for some reason and couldn't recover (like log corruption). In
this case an early warning would help in the root cause analysis process as
well as enable detecting and resolving the problem early on.

The KIP is here:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-434%3A+Add+Replica+Fetcher+and+Log+Cleaner+Count+Metrics

I'd be happy to receive any feedback on this.

Regards,
Viktor

Re: [DISCUSS] KIP-434: Add Replica Fetcher and Log Cleaner Count Metrics

Posted by Viktor Somogyi-Vass <vi...@gmail.com>.
Hey Folks,

Since there were no discussions on this in the past two weeks I'll create a
VOTE thread soon.

Thanks,
Viktor

On Thu, Mar 14, 2019 at 7:05 PM Viktor Somogyi-Vass <vi...@gmail.com>
wrote:

> Hey Stanislav,
>
> Sorry for the delay on this. In the meantime I realized that the dead
> fetchers won't be removed from the fetcher map, so it's very easy to figure
> out how many dead and alive there are. I can collect them on broker level
> which I think gives a good enough information if there is a problem with a
> given broker. You do raise a good point with your idea that it's helpful to
> know which fetcher is acting up, although I find that solution less
> feasible due to the number of metrics generated. For this I'd use a JMX
> method that'd return a list of problematic fetcher threads but I'm not sure
> if we have to extend the scope of this KIP that much.
>
> Best,
> Viktor
>
> On Mon, Mar 4, 2019 at 7:22 PM Stanislav Kozlovski <st...@confluent.io>
> wrote:
>
>> Hey Viktor,
>>
>> > however displaying the thread count (be it alive or dead) would still
>> add
>> extra information regarding the failure, that a thread died during
>> cleanup.
>>
>> I agree, I think it's worth adding.
>>
>>
>> > Doing this on the replica fetchers though would be a bit harder as the
>> number of replica fetchers is the (brokers-to-fetch-from *
>> fetchers-per-broker) and we don't really maintain the capacity information
>> or any kind of cluster information and I'm not sure we should.
>>
>> Perhaps we could split the metric per broker that is being fetched from?
>> i.e each replica fetcher would have a `dead-fetcher-threads` metric that
>> has the broker-id it's fetching from as a tag?
>> It would solve an observability question which I think is very important -
>> are we replicating from this broker at all?
>> On the other hand, this could potentially produce a lot of metric data
>> with
>> a big cluster, so that is definitely something to consider as well.
>>
>> All in all, I think this is a great KIP and very much needed in my
>> opinion.
>> I can't wait to see this roll out
>>
>> Best,
>> Stanislav
>>
>> On Mon, Feb 25, 2019 at 10:29 AM Viktor Somogyi-Vass <
>> viktorsomogyi@gmail.com> wrote:
>>
>> > Hi Stanislav,
>> >
>> > Thanks for the feedback and sharing that discussion thread.
>> >
>> > I read your KIP and the discussion on it too and it seems like that'd
>> cover
>> > the same motivation I had with the log-cleaner-thread-count metric. This
>> > supposed to tell the count of the alive threads which might differ from
>> the
>> > config (I could've used a better name :) ). Now I'm thinking that using
>> > uncleanable-bytes, uncleanable-partition-count together with
>> > time-since-last-run would mostly cover the motivation I have in this
>> KIP,
>> > however displaying the thread count (be it alive or dead) would still
>> add
>> > extra information regarding the failure, that a thread died during
>> cleanup.
>> >
>> > You had a very good idea about instead of the alive threads, display the
>> > dead ones! That way we wouldn't need
>> log-cleaner-current-live-thread-rate
>> > just a "dead-log-cleaner-thread-count" as it it would make easy to
>> trigger
>> > warnings based on that (if it's even > 0 then we can say there's a
>> > potential problem).
>> > Doing this on the replica fetchers though would be a bit harder as the
>> > number of replica fetchers is the (brokers-to-fetch-from *
>> > fetchers-per-broker) and we don't really maintain the capacity
>> information
>> > or any kind of cluster information and I'm not sure we should. It would
>> add
>> > too much responsibility to the class and wouldn't be a rock-solid
>> solution
>> > but I guess I have to look into it more.
>> >
>> > I don't think that restarting the cleaner threads would be helpful as
>> the
>> > problems I've seen mostly are non-recoverable and requires manual user
>> > intervention and partly I agree what Colin said on the KIP-346
>> discussion
>> > thread about the problems experienced with HDFS.
>> >
>> > Best,
>> > Viktor
>> >
>> >
>> > On Fri, Feb 22, 2019 at 5:03 PM Stanislav Kozlovski <
>> > stanislav@confluent.io>
>> > wrote:
>> >
>> > > Hey Viktor,
>> > >
>> > > First off, thanks for the KIP! I think that it is almost always a good
>> > idea
>> > > to have more metrics. Observability never hurts.
>> > >
>> > > In regards to the LogCleaner:
>> > > * Do we need to know log-cleaner-thread-count? That should always be
>> > equal
>> > > to "log.cleaner.threads" if I'm not mistaken.
>> > > * log-cleaner-current-live-thread-rate -  We already have the
>> > > "time-since-last-run-ms" metric which can let you know if something is
>> > > wrong with the log cleaning
>> > > As you said, we would like to have these two new metrics in order to
>> > > understand when a partial failure has happened - e.g only 1/3 log
>> cleaner
>> > > threads are alive. I'm wondering if it may make more sense to either:
>> > > a) restart the threads when they die
>> > > b) add a metric which shows the dead thread count. You should probably
>> > > always have a low-level alert in the case that any threads have died
>> > >
>> > > We had discussed a similar topic about thread revival and metrics in
>> > > KIP-346. Have you had a chance to look over that discussion? Here is
>> the
>> > > mailing discussion for that -
>> > >
>> > >
>> >
>> http://mail-archives.apache.org/mod_mbox/kafka-dev/201807.mbox/%3CCANZZNGyR_22GO9SwL67hedCM90XhVPyFzy_TezHZ1MriZqK_tg@mail.gmail.com%3E
>> > >
>> > > Best,
>> > > Stanislav
>> > >
>> > >
>> > >
>> > > On Fri, Feb 22, 2019 at 11:18 AM Viktor Somogyi-Vass <
>> > > viktorsomogyi@gmail.com> wrote:
>> > >
>> > > > Hi All,
>> > > >
>> > > > I'd like to start a discussion about exposing count gauge metrics
>> for
>> > the
>> > > > replica fetcher and log cleaner thread counts. It isn't a long KIP
>> and
>> > > the
>> > > > motivation is very simple: monitoring the thread counts in these
>> cases
>> > > > would help with the investigation of various issues and might help
>> in
>> > > > preventing more serious issues when a broker is in a bad state.
>> Such a
>> > > > scenario that we seen with users is that their disk fills up as the
>> log
>> > > > cleaner died for some reason and couldn't recover (like log
>> > corruption).
>> > > In
>> > > > this case an early warning would help in the root cause analysis
>> > process
>> > > as
>> > > > well as enable detecting and resolving the problem early on.
>> > > >
>> > > > The KIP is here:
>> > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-434%3A+Add+Replica+Fetcher+and+Log+Cleaner+Count+Metrics
>> > > >
>> > > > I'd be happy to receive any feedback on this.
>> > > >
>> > > > Regards,
>> > > > Viktor
>> > > >
>> > >
>> > >
>> > > --
>> > > Best,
>> > > Stanislav
>> > >
>> >
>>
>>
>> --
>> Best,
>> Stanislav
>>
>

Re: [DISCUSS] KIP-434: Add Replica Fetcher and Log Cleaner Count Metrics

Posted by Viktor Somogyi-Vass <vi...@gmail.com>.
Hey Stanislav,

Sorry for the delay on this. In the meantime I realized that the dead
fetchers won't be removed from the fetcher map, so it's very easy to figure
out how many dead and alive there are. I can collect them on broker level
which I think gives a good enough information if there is a problem with a
given broker. You do raise a good point with your idea that it's helpful to
know which fetcher is acting up, although I find that solution less
feasible due to the number of metrics generated. For this I'd use a JMX
method that'd return a list of problematic fetcher threads but I'm not sure
if we have to extend the scope of this KIP that much.

Best,
Viktor

On Mon, Mar 4, 2019 at 7:22 PM Stanislav Kozlovski <st...@confluent.io>
wrote:

> Hey Viktor,
>
> > however displaying the thread count (be it alive or dead) would still add
> extra information regarding the failure, that a thread died during cleanup.
>
> I agree, I think it's worth adding.
>
>
> > Doing this on the replica fetchers though would be a bit harder as the
> number of replica fetchers is the (brokers-to-fetch-from *
> fetchers-per-broker) and we don't really maintain the capacity information
> or any kind of cluster information and I'm not sure we should.
>
> Perhaps we could split the metric per broker that is being fetched from?
> i.e each replica fetcher would have a `dead-fetcher-threads` metric that
> has the broker-id it's fetching from as a tag?
> It would solve an observability question which I think is very important -
> are we replicating from this broker at all?
> On the other hand, this could potentially produce a lot of metric data with
> a big cluster, so that is definitely something to consider as well.
>
> All in all, I think this is a great KIP and very much needed in my opinion.
> I can't wait to see this roll out
>
> Best,
> Stanislav
>
> On Mon, Feb 25, 2019 at 10:29 AM Viktor Somogyi-Vass <
> viktorsomogyi@gmail.com> wrote:
>
> > Hi Stanislav,
> >
> > Thanks for the feedback and sharing that discussion thread.
> >
> > I read your KIP and the discussion on it too and it seems like that'd
> cover
> > the same motivation I had with the log-cleaner-thread-count metric. This
> > supposed to tell the count of the alive threads which might differ from
> the
> > config (I could've used a better name :) ). Now I'm thinking that using
> > uncleanable-bytes, uncleanable-partition-count together with
> > time-since-last-run would mostly cover the motivation I have in this KIP,
> > however displaying the thread count (be it alive or dead) would still add
> > extra information regarding the failure, that a thread died during
> cleanup.
> >
> > You had a very good idea about instead of the alive threads, display the
> > dead ones! That way we wouldn't need log-cleaner-current-live-thread-rate
> > just a "dead-log-cleaner-thread-count" as it it would make easy to
> trigger
> > warnings based on that (if it's even > 0 then we can say there's a
> > potential problem).
> > Doing this on the replica fetchers though would be a bit harder as the
> > number of replica fetchers is the (brokers-to-fetch-from *
> > fetchers-per-broker) and we don't really maintain the capacity
> information
> > or any kind of cluster information and I'm not sure we should. It would
> add
> > too much responsibility to the class and wouldn't be a rock-solid
> solution
> > but I guess I have to look into it more.
> >
> > I don't think that restarting the cleaner threads would be helpful as the
> > problems I've seen mostly are non-recoverable and requires manual user
> > intervention and partly I agree what Colin said on the KIP-346 discussion
> > thread about the problems experienced with HDFS.
> >
> > Best,
> > Viktor
> >
> >
> > On Fri, Feb 22, 2019 at 5:03 PM Stanislav Kozlovski <
> > stanislav@confluent.io>
> > wrote:
> >
> > > Hey Viktor,
> > >
> > > First off, thanks for the KIP! I think that it is almost always a good
> > idea
> > > to have more metrics. Observability never hurts.
> > >
> > > In regards to the LogCleaner:
> > > * Do we need to know log-cleaner-thread-count? That should always be
> > equal
> > > to "log.cleaner.threads" if I'm not mistaken.
> > > * log-cleaner-current-live-thread-rate -  We already have the
> > > "time-since-last-run-ms" metric which can let you know if something is
> > > wrong with the log cleaning
> > > As you said, we would like to have these two new metrics in order to
> > > understand when a partial failure has happened - e.g only 1/3 log
> cleaner
> > > threads are alive. I'm wondering if it may make more sense to either:
> > > a) restart the threads when they die
> > > b) add a metric which shows the dead thread count. You should probably
> > > always have a low-level alert in the case that any threads have died
> > >
> > > We had discussed a similar topic about thread revival and metrics in
> > > KIP-346. Have you had a chance to look over that discussion? Here is
> the
> > > mailing discussion for that -
> > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201807.mbox/%3CCANZZNGyR_22GO9SwL67hedCM90XhVPyFzy_TezHZ1MriZqK_tg@mail.gmail.com%3E
> > >
> > > Best,
> > > Stanislav
> > >
> > >
> > >
> > > On Fri, Feb 22, 2019 at 11:18 AM Viktor Somogyi-Vass <
> > > viktorsomogyi@gmail.com> wrote:
> > >
> > > > Hi All,
> > > >
> > > > I'd like to start a discussion about exposing count gauge metrics for
> > the
> > > > replica fetcher and log cleaner thread counts. It isn't a long KIP
> and
> > > the
> > > > motivation is very simple: monitoring the thread counts in these
> cases
> > > > would help with the investigation of various issues and might help in
> > > > preventing more serious issues when a broker is in a bad state. Such
> a
> > > > scenario that we seen with users is that their disk fills up as the
> log
> > > > cleaner died for some reason and couldn't recover (like log
> > corruption).
> > > In
> > > > this case an early warning would help in the root cause analysis
> > process
> > > as
> > > > well as enable detecting and resolving the problem early on.
> > > >
> > > > The KIP is here:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-434%3A+Add+Replica+Fetcher+and+Log+Cleaner+Count+Metrics
> > > >
> > > > I'd be happy to receive any feedback on this.
> > > >
> > > > Regards,
> > > > Viktor
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Stanislav
> > >
> >
>
>
> --
> Best,
> Stanislav
>

Re: [DISCUSS] KIP-434: Add Replica Fetcher and Log Cleaner Count Metrics

Posted by Stanislav Kozlovski <st...@confluent.io>.
Hey Viktor,

> however displaying the thread count (be it alive or dead) would still add
extra information regarding the failure, that a thread died during cleanup.

I agree, I think it's worth adding.


> Doing this on the replica fetchers though would be a bit harder as the
number of replica fetchers is the (brokers-to-fetch-from *
fetchers-per-broker) and we don't really maintain the capacity information
or any kind of cluster information and I'm not sure we should.

Perhaps we could split the metric per broker that is being fetched from?
i.e each replica fetcher would have a `dead-fetcher-threads` metric that
has the broker-id it's fetching from as a tag?
It would solve an observability question which I think is very important -
are we replicating from this broker at all?
On the other hand, this could potentially produce a lot of metric data with
a big cluster, so that is definitely something to consider as well.

All in all, I think this is a great KIP and very much needed in my opinion.
I can't wait to see this roll out

Best,
Stanislav

On Mon, Feb 25, 2019 at 10:29 AM Viktor Somogyi-Vass <
viktorsomogyi@gmail.com> wrote:

> Hi Stanislav,
>
> Thanks for the feedback and sharing that discussion thread.
>
> I read your KIP and the discussion on it too and it seems like that'd cover
> the same motivation I had with the log-cleaner-thread-count metric. This
> supposed to tell the count of the alive threads which might differ from the
> config (I could've used a better name :) ). Now I'm thinking that using
> uncleanable-bytes, uncleanable-partition-count together with
> time-since-last-run would mostly cover the motivation I have in this KIP,
> however displaying the thread count (be it alive or dead) would still add
> extra information regarding the failure, that a thread died during cleanup.
>
> You had a very good idea about instead of the alive threads, display the
> dead ones! That way we wouldn't need log-cleaner-current-live-thread-rate
> just a "dead-log-cleaner-thread-count" as it it would make easy to trigger
> warnings based on that (if it's even > 0 then we can say there's a
> potential problem).
> Doing this on the replica fetchers though would be a bit harder as the
> number of replica fetchers is the (brokers-to-fetch-from *
> fetchers-per-broker) and we don't really maintain the capacity information
> or any kind of cluster information and I'm not sure we should. It would add
> too much responsibility to the class and wouldn't be a rock-solid solution
> but I guess I have to look into it more.
>
> I don't think that restarting the cleaner threads would be helpful as the
> problems I've seen mostly are non-recoverable and requires manual user
> intervention and partly I agree what Colin said on the KIP-346 discussion
> thread about the problems experienced with HDFS.
>
> Best,
> Viktor
>
>
> On Fri, Feb 22, 2019 at 5:03 PM Stanislav Kozlovski <
> stanislav@confluent.io>
> wrote:
>
> > Hey Viktor,
> >
> > First off, thanks for the KIP! I think that it is almost always a good
> idea
> > to have more metrics. Observability never hurts.
> >
> > In regards to the LogCleaner:
> > * Do we need to know log-cleaner-thread-count? That should always be
> equal
> > to "log.cleaner.threads" if I'm not mistaken.
> > * log-cleaner-current-live-thread-rate -  We already have the
> > "time-since-last-run-ms" metric which can let you know if something is
> > wrong with the log cleaning
> > As you said, we would like to have these two new metrics in order to
> > understand when a partial failure has happened - e.g only 1/3 log cleaner
> > threads are alive. I'm wondering if it may make more sense to either:
> > a) restart the threads when they die
> > b) add a metric which shows the dead thread count. You should probably
> > always have a low-level alert in the case that any threads have died
> >
> > We had discussed a similar topic about thread revival and metrics in
> > KIP-346. Have you had a chance to look over that discussion? Here is the
> > mailing discussion for that -
> >
> >
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201807.mbox/%3CCANZZNGyR_22GO9SwL67hedCM90XhVPyFzy_TezHZ1MriZqK_tg@mail.gmail.com%3E
> >
> > Best,
> > Stanislav
> >
> >
> >
> > On Fri, Feb 22, 2019 at 11:18 AM Viktor Somogyi-Vass <
> > viktorsomogyi@gmail.com> wrote:
> >
> > > Hi All,
> > >
> > > I'd like to start a discussion about exposing count gauge metrics for
> the
> > > replica fetcher and log cleaner thread counts. It isn't a long KIP and
> > the
> > > motivation is very simple: monitoring the thread counts in these cases
> > > would help with the investigation of various issues and might help in
> > > preventing more serious issues when a broker is in a bad state. Such a
> > > scenario that we seen with users is that their disk fills up as the log
> > > cleaner died for some reason and couldn't recover (like log
> corruption).
> > In
> > > this case an early warning would help in the root cause analysis
> process
> > as
> > > well as enable detecting and resolving the problem early on.
> > >
> > > The KIP is here:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-434%3A+Add+Replica+Fetcher+and+Log+Cleaner+Count+Metrics
> > >
> > > I'd be happy to receive any feedback on this.
> > >
> > > Regards,
> > > Viktor
> > >
> >
> >
> > --
> > Best,
> > Stanislav
> >
>


-- 
Best,
Stanislav

Re: [DISCUSS] KIP-434: Add Replica Fetcher and Log Cleaner Count Metrics

Posted by Viktor Somogyi-Vass <vi...@gmail.com>.
Hi Stanislav,

Thanks for the feedback and sharing that discussion thread.

I read your KIP and the discussion on it too and it seems like that'd cover
the same motivation I had with the log-cleaner-thread-count metric. This
supposed to tell the count of the alive threads which might differ from the
config (I could've used a better name :) ). Now I'm thinking that using
uncleanable-bytes, uncleanable-partition-count together with
time-since-last-run would mostly cover the motivation I have in this KIP,
however displaying the thread count (be it alive or dead) would still add
extra information regarding the failure, that a thread died during cleanup.

You had a very good idea about instead of the alive threads, display the
dead ones! That way we wouldn't need log-cleaner-current-live-thread-rate
just a "dead-log-cleaner-thread-count" as it it would make easy to trigger
warnings based on that (if it's even > 0 then we can say there's a
potential problem).
Doing this on the replica fetchers though would be a bit harder as the
number of replica fetchers is the (brokers-to-fetch-from *
fetchers-per-broker) and we don't really maintain the capacity information
or any kind of cluster information and I'm not sure we should. It would add
too much responsibility to the class and wouldn't be a rock-solid solution
but I guess I have to look into it more.

I don't think that restarting the cleaner threads would be helpful as the
problems I've seen mostly are non-recoverable and requires manual user
intervention and partly I agree what Colin said on the KIP-346 discussion
thread about the problems experienced with HDFS.

Best,
Viktor


On Fri, Feb 22, 2019 at 5:03 PM Stanislav Kozlovski <st...@confluent.io>
wrote:

> Hey Viktor,
>
> First off, thanks for the KIP! I think that it is almost always a good idea
> to have more metrics. Observability never hurts.
>
> In regards to the LogCleaner:
> * Do we need to know log-cleaner-thread-count? That should always be equal
> to "log.cleaner.threads" if I'm not mistaken.
> * log-cleaner-current-live-thread-rate -  We already have the
> "time-since-last-run-ms" metric which can let you know if something is
> wrong with the log cleaning
> As you said, we would like to have these two new metrics in order to
> understand when a partial failure has happened - e.g only 1/3 log cleaner
> threads are alive. I'm wondering if it may make more sense to either:
> a) restart the threads when they die
> b) add a metric which shows the dead thread count. You should probably
> always have a low-level alert in the case that any threads have died
>
> We had discussed a similar topic about thread revival and metrics in
> KIP-346. Have you had a chance to look over that discussion? Here is the
> mailing discussion for that -
>
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201807.mbox/%3CCANZZNGyR_22GO9SwL67hedCM90XhVPyFzy_TezHZ1MriZqK_tg@mail.gmail.com%3E
>
> Best,
> Stanislav
>
>
>
> On Fri, Feb 22, 2019 at 11:18 AM Viktor Somogyi-Vass <
> viktorsomogyi@gmail.com> wrote:
>
> > Hi All,
> >
> > I'd like to start a discussion about exposing count gauge metrics for the
> > replica fetcher and log cleaner thread counts. It isn't a long KIP and
> the
> > motivation is very simple: monitoring the thread counts in these cases
> > would help with the investigation of various issues and might help in
> > preventing more serious issues when a broker is in a bad state. Such a
> > scenario that we seen with users is that their disk fills up as the log
> > cleaner died for some reason and couldn't recover (like log corruption).
> In
> > this case an early warning would help in the root cause analysis process
> as
> > well as enable detecting and resolving the problem early on.
> >
> > The KIP is here:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-434%3A+Add+Replica+Fetcher+and+Log+Cleaner+Count+Metrics
> >
> > I'd be happy to receive any feedback on this.
> >
> > Regards,
> > Viktor
> >
>
>
> --
> Best,
> Stanislav
>

Re: [DISCUSS] KIP-434: Add Replica Fetcher and Log Cleaner Count Metrics

Posted by Stanislav Kozlovski <st...@confluent.io>.
Hey Viktor,

First off, thanks for the KIP! I think that it is almost always a good idea
to have more metrics. Observability never hurts.

In regards to the LogCleaner:
* Do we need to know log-cleaner-thread-count? That should always be equal
to "log.cleaner.threads" if I'm not mistaken.
* log-cleaner-current-live-thread-rate -  We already have the
"time-since-last-run-ms" metric which can let you know if something is
wrong with the log cleaning
As you said, we would like to have these two new metrics in order to
understand when a partial failure has happened - e.g only 1/3 log cleaner
threads are alive. I'm wondering if it may make more sense to either:
a) restart the threads when they die
b) add a metric which shows the dead thread count. You should probably
always have a low-level alert in the case that any threads have died

We had discussed a similar topic about thread revival and metrics in
KIP-346. Have you had a chance to look over that discussion? Here is the
mailing discussion for that -
http://mail-archives.apache.org/mod_mbox/kafka-dev/201807.mbox/%3CCANZZNGyR_22GO9SwL67hedCM90XhVPyFzy_TezHZ1MriZqK_tg@mail.gmail.com%3E

Best,
Stanislav



On Fri, Feb 22, 2019 at 11:18 AM Viktor Somogyi-Vass <
viktorsomogyi@gmail.com> wrote:

> Hi All,
>
> I'd like to start a discussion about exposing count gauge metrics for the
> replica fetcher and log cleaner thread counts. It isn't a long KIP and the
> motivation is very simple: monitoring the thread counts in these cases
> would help with the investigation of various issues and might help in
> preventing more serious issues when a broker is in a bad state. Such a
> scenario that we seen with users is that their disk fills up as the log
> cleaner died for some reason and couldn't recover (like log corruption). In
> this case an early warning would help in the root cause analysis process as
> well as enable detecting and resolving the problem early on.
>
> The KIP is here:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-434%3A+Add+Replica+Fetcher+and+Log+Cleaner+Count+Metrics
>
> I'd be happy to receive any feedback on this.
>
> Regards,
> Viktor
>


-- 
Best,
Stanislav