You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Stanislav Kozlovski <st...@confluent.io> on 2018/07/24 00:11:19 UTC

[DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Hey group,

I created a new KIP about making log compaction more fault-tolerant. Please
give it a look here and please share what you think, especially in regards
to the points in the "Needs Discussion" paragraph.

KIP: KIP-346
<https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure>
-- 
Best,
Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Stanislav Kozlovski <st...@confluent.io>.

Hey James, Ted,

@James - Thanks for showing me some of the changes, that was informative.

* *Log Cleaner Thread Revival* - I also acknowledge that could be useful.
My concern is that if the thread has died, there is most likely something
wrong with either the disk or the software and since both are deterministic
(correct me if I'm wrong), we will most likely hit it very soon again. I am
not sure that scenario would be any good, but I am also not sure if it
would hurt. Could it waste a significant amount of CPU from dying and
running again?

* *Partition Re-clean* - Hmm, maybe some sort of retry mechanism could be
worth exploring. I'd like to hear other people's opinion on this and
whether or not they've seen such scenarios before diving into possible
implementation.

* *Metric* - Could you point me to the some resources showing how the JMX
metrics should be structured? I could not found any and am sadly not too
knowledgeable on the topic

* *uncleanable-partitions* *metric* - Yes, that might be problematic. Maybe
the format Ted suggested would be best - "topic1-0,1,2". Then again, I fear
we might still run out of characters. I am not sure how to best approach
this yet.

* *Disk Problems* - I am aware that the 4 JIRAs are not related to disk
problems. I think this KIP brings the most value to exactly such scenarios
- ones where the disk is OK. But then again, I thought I'd suggest failing
the disk after a certain number of errors on it since it makes sense to me.
I do not have a strong opinion about this, though. Now that you mentioned
that this actually increases the blast radius - I tend to agree. Maybe we
should scrap this behavior.

Best,
Stanislav

On Tue, Jul 24, 2018 at 6:13 AM Ted Yu <yu...@gmail.com> wrote:

> As James pointed out in his reply, topic-partition name can be long.
> It is not necessary to repeat the topic name for each of its partitions.
> How about the following format:
>
> topic-name1-{partition1, partition2, etc}
>
> That is, topic name only appears once.
>
> Cheers
>
> On Mon, Jul 23, 2018 at 9:08 PM Stanislav Kozlovski <
> stanislav@confluent.io>
> wrote:
>
> > Hi Ted,
> >
> > Yes, absolutely. Thanks for pointing that out!
> >
> > On Mon, Jul 23, 2018 at 6:12 PM Ted Yu <yu...@gmail.com> wrote:
> >
> > > For `uncleanable-partitions`, should the example include topic name(s)
> ?
> > >
> > > Cheers
> > >
> > > On Mon, Jul 23, 2018 at 5:46 PM Stanislav Kozlovski <
> > > stanislav@confluent.io>
> > > wrote:
> > >
> > > > I renamed the KIP and that changed the link. Sorry about that. Here
> is
> > > the
> > > > new link:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> > > >
> > > > On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
> > > > stanislav@confluent.io>
> > > > wrote:
> > > >
> > > > > Hey group,
> > > > >
> > > > > I created a new KIP about making log compaction more
> fault-tolerant.
> > > > > Please give it a look here and please share what you think,
> > especially
> > > in
> > > > > regards to the points in the "Needs Discussion" paragraph.
> > > > >
> > > > > KIP: KIP-346
> > > > > <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
> > > > >
> > > > > --
> > > > > Best,
> > > > > Stanislav
> > > > >
> > > >
> > > >
> > > > --
> > > > Best,
> > > > Stanislav
> > > >
> > >
> >
> >
> > --
> > Best,
> > Stanislav
> >
>

-- 
Best,
Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Ted Yu <yu...@gmail.com>.

As James pointed out in his reply, topic-partition name can be long.
It is not necessary to repeat the topic name for each of its partitions.
How about the following format:

topic-name1-{partition1, partition2, etc}

That is, topic name only appears once.

Cheers

On Mon, Jul 23, 2018 at 9:08 PM Stanislav Kozlovski <st...@confluent.io>
wrote:

> Hi Ted,
>
> Yes, absolutely. Thanks for pointing that out!
>
> On Mon, Jul 23, 2018 at 6:12 PM Ted Yu <yu...@gmail.com> wrote:
>
> > For `uncleanable-partitions`, should the example include topic name(s) ?
> >
> > Cheers
> >
> > On Mon, Jul 23, 2018 at 5:46 PM Stanislav Kozlovski <
> > stanislav@confluent.io>
> > wrote:
> >
> > > I renamed the KIP and that changed the link. Sorry about that. Here is
> > the
> > > new link:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> > >
> > > On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
> > > stanislav@confluent.io>
> > > wrote:
> > >
> > > > Hey group,
> > > >
> > > > I created a new KIP about making log compaction more fault-tolerant.
> > > > Please give it a look here and please share what you think,
> especially
> > in
> > > > regards to the points in the "Needs Discussion" paragraph.
> > > >
> > > > KIP: KIP-346
> > > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
> > > >
> > > > --
> > > > Best,
> > > > Stanislav
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Stanislav
> > >
> >
>
>
> --
> Best,
> Stanislav
>

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Stanislav Kozlovski <st...@confluent.io>.

Hi Ted,

Yes, absolutely. Thanks for pointing that out!

On Mon, Jul 23, 2018 at 6:12 PM Ted Yu <yu...@gmail.com> wrote:

> For `uncleanable-partitions`, should the example include topic name(s) ?
>
> Cheers
>
> On Mon, Jul 23, 2018 at 5:46 PM Stanislav Kozlovski <
> stanislav@confluent.io>
> wrote:
>
> > I renamed the KIP and that changed the link. Sorry about that. Here is
> the
> > new link:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> >
> > On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
> > stanislav@confluent.io>
> > wrote:
> >
> > > Hey group,
> > >
> > > I created a new KIP about making log compaction more fault-tolerant.
> > > Please give it a look here and please share what you think, especially
> in
> > > regards to the points in the "Needs Discussion" paragraph.
> > >
> > > KIP: KIP-346
> > > <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
> > >
> > > --
> > > Best,
> > > Stanislav
> > >
> >
> >
> > --
> > Best,
> > Stanislav
> >
>


-- 
Best,
Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Ted Yu <yu...@gmail.com>.

For `uncleanable-partitions`, should the example include topic name(s) ?

Cheers

On Mon, Jul 23, 2018 at 5:46 PM Stanislav Kozlovski <st...@confluent.io>
wrote:

> I renamed the KIP and that changed the link. Sorry about that. Here is the
> new link:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
>
> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
> stanislav@confluent.io>
> wrote:
>
> > Hey group,
> >
> > I created a new KIP about making log compaction more fault-tolerant.
> > Please give it a look here and please share what you think, especially in
> > regards to the points in the "Needs Discussion" paragraph.
> >
> > KIP: KIP-346
> > <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
> >
> > --
> > Best,
> > Stanislav
> >
>
>
> --
> Best,
> Stanislav
>

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Ismael Juma <is...@juma.me.uk>.

On Thu, Aug 2, 2018 at 9:55 AM Colin McCabe <cm...@apache.org> wrote:

> On Wed, Aug 1, 2018, at 11:35, James Cheng wrote:
> > I’m a little confused about something. Is this KIP focused on log
> > cleaner exceptions in general, or focused on log cleaner exceptions due
> > to disk failures?
> >
> > Will max.uncleanable.partitions apply to all exceptions (including log
> > cleaner logic errors) or will it apply to only disk I/o exceptions?
>
> There is no difference between "log cleaner exceptions in general" and
> "log cleaner exceptions due to disk failures."
>
> For example, if the data on disk is corrupted we might read a 4-byte size
> as -1 instead of 100.  Then we would get a BufferUnderFlowException later
> on.  This is a subclass of RuntimeException rather than IOException, of
> course, but it does result from a disk problem.  Or we might get exceptions
> while validating checksums, which may or may not be IOE (I haven't looked).
>
> Of course, the log cleaner itself may have a bug, which results in it
> throwing an exception even if the disk does not have a problem.  We clearly
> want to fix these bugs.  But there's no way for the program itself to know
> that it has a bug and act differently.  If an exception occurs, we must
> assume there is a disk problem.


Hey Colin,

This is inconsistent with how we deal with disk failures outside of the log
cleaner. We should follow the same approach across the board so that we can
reason about how the system works. If we think the approach of using
specific exception types for disk related errors doesn't work, we should do
a KIP for that. For this KIP, I suggest we use the same approach we use to
mark disks as offline.

Ismael

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Colin McCabe <cm...@apache.org>.

On Wed, Aug 1, 2018, at 11:35, James Cheng wrote:
> I’m a little confused about something. Is this KIP focused on log 
> cleaner exceptions in general, or focused on log cleaner exceptions due 
> to disk failures?
> 
> Will max.uncleanable.partitions apply to all exceptions (including log 
> cleaner logic errors) or will it apply to only disk I/o exceptions?

There is no difference between "log cleaner exceptions in general" and "log cleaner exceptions due to disk failures."

For example, if the data on disk is corrupted we might read a 4-byte size as -1 instead of 100.  Then we would get a BufferUnderFlowException later on.  This is a subclass of RuntimeException rather than IOException, of course, but it does result from a disk problem.  Or we might get exceptions while validating checksums, which may or may not be IOE (I haven't looked).

Of course, the log cleaner itself may have a bug, which results in it throwing an exception even if the disk does not have a problem.  We clearly want to fix these bugs.  But there's no way for the program itself to know that it has a bug and act differently.  If an exception occurs, we must assume there is a disk problem.

> 
> I can understand taking the disk offline if there have been “N” I/O 
> exceptions. Disk errors are user fixable (by replacing the affected 
> disk). It turns an invisible (soft?) failure into a visible hard 
> failure. And the I/O exceptions are possibly already causing problems, 
> so it makes sense to limit their impact.
> 
> But I’m not sure if it makes sense to take a disk offline after”N” logic 
> errors in the log cleaner. If a log cleaner logic error happens, it’s 
> rarely user fixable. And it will likely several partitions at once, so 
> you’re likely to bump up against the max.uncleanable.partitions limit 
> more quickly. If a disk was taken due to logic errors, I’m not sure what 
> the user would do.

I don't agree that log cleaner bugs "will likely [affect] several partitions at once."  Most of the ones I've looked at only affect one or two partitions.  In particular the ones that resulted from over-eagerness to use 32-bit math on 64-bit values.

If the log cleaner is so buggy that it's useless (the scenario you're describing), and you want to put off an upgrade, then you can set max.uncleanable.partitions to the maximum value to ignore failures.

best,
Colin


> 
> -James
> 
> Sent from my iPhone
> 
> > On Aug 1, 2018, at 9:11 AM, Stanislav Kozlovski <st...@confluent.io> wrote:
> > 
> > Yes, good catch. Thank you, James!
> > 
> > Best,
> > Stanislav
> > 
> >> On Wed, Aug 1, 2018 at 5:05 PM James Cheng <wu...@gmail.com> wrote:
> >> 
> >> Can you update the KIP to say what the default is for
> >> max.uncleanable.partitions?
> >> 
> >> -James
> >> 
> >> Sent from my iPhone
> >> 
> >>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <st...@confluent.io>
> >> wrote:
> >>> 
> >>> Hey group,
> >>> 
> >>> I am planning on starting a voting thread tomorrow. Please do reply if
> >> you
> >>> feel there is anything left to discuss.
> >>> 
> >>> Best,
> >>> Stanislav
> >>> 
> >>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
> >> stanislav@confluent.io>
> >>> wrote:
> >>> 
> >>>> Hey, Ray
> >>>> 
> >>>> Thanks for pointing that out, it's fixed now
> >>>> 
> >>>> Best,
> >>>> Stanislav
> >>>> 
> >>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rc...@apache.org> wrote:
> >>>>> 
> >>>>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
> >>>>> the main KIP landing page
> >>>>> <
> >>>>> 
> >> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#
> >>> ?
> >>>>> 
> >>>>> I tried, but the Wiki won't let me.
> >>>>> 
> >>>>> -Ray
> >>>>> 
> >>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
> >>>>>> Hey guys,
> >>>>>> 
> >>>>>> @Colin - good point. I added some sentences mentioning recent
> >>>>> improvements
> >>>>>> in the introductory section.
> >>>>>> 
> >>>>>> *Disk Failure* - I tend to agree with what Colin said - once a disk
> >>>>> fails,
> >>>>>> you don't want to work with it again. As such, I've changed my mind
> >> and
> >>>>>> believe that we should mark the LogDir (assume its a disk) as offline
> >> on
> >>>>>> the first `IOException` encountered. This is the LogCleaner's current
> >>>>>> behavior. We shouldn't change that.
> >>>>>> 
> >>>>>> *Respawning Threads* - I believe we should never re-spawn a thread.
> >> The
> >>>>>> correct approach in my mind is to either have it stay dead or never
> >> let
> >>>>> it
> >>>>>> die in the first place.
> >>>>>> 
> >>>>>> *Uncleanable-partition-names metric* - Colin is right, this metric is
> >>>>>> unneeded. Users can monitor the `uncleanable-partitions-count` metric
> >>>>> and
> >>>>>> inspect logs.
> >>>>>> 
> >>>>>> 
> >>>>>> Hey Ray,
> >>>>>> 
> >>>>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner to
> >>>>>>> skip over problematic partitions instead of dying.
> >>>>>> I think we can do this for every exception that isn't `IOException`.
> >>>>> This
> >>>>>> will future-proof us against bugs in the system and potential other
> >>>>> errors.
> >>>>>> Protecting yourself against unexpected failures is always a good thing
> >>>>> in
> >>>>>> my mind, but I also think that protecting yourself against bugs in the
> >>>>>> software is sort of clunky. What does everybody think about this?
> >>>>>> 
> >>>>>>> 4) The only improvement I can think of is that if such an
> >>>>>>> error occurs, then have the option (configuration setting?) to
> >> create a
> >>>>>>> <log_segment>.skip file (or something similar).
> >>>>>> This is a good suggestion. Have others also seen corruption be
> >> generally
> >>>>>> tied to the same segment?
> >>>>>> 
> >>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dh...@confluent.io>
> >>>>> wrote:
> >>>>>> 
> >>>>>>> For the cleaner thread specifically, I do not think respawning will
> >>>>> help at
> >>>>>>> all because we are more than likely to run into the same issue again
> >>>>> which
> >>>>>>> would end up crashing the cleaner. Retrying makes sense for transient
> >>>>>>> errors or when you believe some part of the system could have healed
> >>>>>>> itself, both of which I think are not true for the log cleaner.
> >>>>>>> 
> >>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rn...@gmail.com>
> >>>>> wrote:
> >>>>>>> 
> >>>>>>>> <<<respawning threads is likely to make things worse, by putting you
> >>>>> in
> >>>>>>> an
> >>>>>>>> infinite loop which consumes resources and fires off continuous log
> >>>>>>>> messages.
> >>>>>>>> Hi Colin.  In case it could be relevant, one way to mitigate this
> >>>>> effect
> >>>>>>> is
> >>>>>>>> to implement a backoff mechanism (if a second respawn is to occur
> >> then
> >>>>>>> wait
> >>>>>>>> for 1 minute before doing it; then if a third respawn is to occur
> >> wait
> >>>>>>> for
> >>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to
> >> some
> >>>>> max
> >>>>>>>> wait time).
> >>>>>>>> 
> >>>>>>>> I have no opinion on whether respawn is appropriate or not in this
> >>>>>>> context,
> >>>>>>>> but a mitigation like the increasing backoff described above may be
> >>>>>>>> relevant in weighing the pros and cons.
> >>>>>>>> 
> >>>>>>>> Ron
> >>>>>>>> 
> >>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cm...@apache.org>
> >>>>> wrote:
> >>>>>>>> 
> >>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> >>>>>>>>>> Hi Stanislav! Thanks for this KIP!
> >>>>>>>>>> 
> >>>>>>>>>> I agree that it would be good if the LogCleaner were more tolerant
> >>>>> of
> >>>>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
> >>>>>>>>>> 
> >>>>>>>>>> Things are better now than they used to be. We have the metric
> >>>>>>>>>>      kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
> >>>>>>>>>> which we can use to tell us if the threads are dead. And as of
> >>>>> 1.1.0,
> >>>>>>>> we
> >>>>>>>>>> have KIP-226, which allows you to restart the log cleaner thread,
> >>>>>>>>>> without requiring a broker restart.
> >>>>>>>>>> 
> >>>>>>> 
> >>>>> 
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> >>>>>>>>>> <
> >>>>>>> 
> >>>>> 
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> >>>>>>>>> 
> >>>>>>>>>> I've only read about this, I haven't personally tried it.
> >>>>>>>>> Thanks for pointing this out, James!  Stanislav, we should probably
> >>>>>>> add a
> >>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the
> >> KIP.
> >>>>>>>> Maybe
> >>>>>>>>> in the intro section?
> >>>>>>>>> 
> >>>>>>>>> I think it's clear that requiring the users to manually restart the
> >>>>> log
> >>>>>>>>> cleaner is not a very good solution.  But it's good to know that
> >>>>> it's a
> >>>>>>>>> possibility on some older releases.
> >>>>>>>>> 
> >>>>>>>>>> Some comments:
> >>>>>>>>>> * I like the idea of having the log cleaner continue to clean as
> >>>>> many
> >>>>>>>>>> partitions as it can, skipping over the problematic ones if
> >>>>> possible.
> >>>>>>>>>> 
> >>>>>>>>>> * If the log cleaner thread dies, I think it should automatically
> >> be
> >>>>>>>>>> revived. Your KIP attempts to do that by catching exceptions
> >> during
> >>>>>>>>>> execution, but I think we should go all the way and make sure
> >> that a
> >>>>>>>> new
> >>>>>>>>>> one gets created, if the thread ever dies.
> >>>>>>>>> This is inconsistent with the way the rest of Kafka works.  We
> >> don't
> >>>>>>>>> automatically re-create other threads in the broker if they
> >>>>> terminate.
> >>>>>>>> In
> >>>>>>>>> general, if there is a serious bug in the code, respawning threads
> >> is
> >>>>>>>>> likely to make things worse, by putting you in an infinite loop
> >> which
> >>>>>>>>> consumes resources and fires off continuous log messages.
> >>>>>>>>> 
> >>>>>>>>>> * It might be worth trying to re-clean the uncleanable partitions.
> >>>>>>> I've
> >>>>>>>>>> seen cases where an uncleanable partition later became cleanable.
> >> I
> >>>>>>>>>> unfortunately don't remember how that happened, but I remember
> >> being
> >>>>>>>>>> surprised when I discovered it. It might have been something like
> >> a
> >>>>>>>>>> follower was uncleanable but after a leader election happened, the
> >>>>>>> log
> >>>>>>>>>> truncated and it was then cleanable again. I'm not sure.
> >>>>>>>>> James, I disagree.  We had this behavior in the Hadoop Distributed
> >>>>> File
> >>>>>>>>> System (HDFS) and it was a constant source of user problems.
> >>>>>>>>> 
> >>>>>>>>> What would happen is disks would just go bad over time.  The
> >> DataNode
> >>>>>>>>> would notice this and take them offline.  But then, due to some
> >>>>>>>>> "optimistic" code, the DataNode would periodically try to re-add
> >> them
> >>>>>>> to
> >>>>>>>>> the system.  Then one of two things would happen: the disk would
> >> just
> >>>>>>>> fail
> >>>>>>>>> immediately again, or it would appear to work and then fail after a
> >>>>>>> short
> >>>>>>>>> amount of time.
> >>>>>>>>> 
> >>>>>>>>> The way the disk failed was normally having an I/O request take a
> >>>>>>> really
> >>>>>>>>> long time and time out.  So a bunch of request handler threads
> >> would
> >>>>>>>>> basically slam into a brick wall when they tried to access the bad
> >>>>>>> disk,
> >>>>>>>>> slowing the DataNode to a crawl.  It was even worse in the second
> >>>>>>>> scenario,
> >>>>>>>>> if the disk appeared to work for a while, but then failed.  Any
> >> data
> >>>>>>> that
> >>>>>>>>> had been written on that DataNode to that disk would be lost, and
> >> we
> >>>>>>>> would
> >>>>>>>>> need to re-replicate it.
> >>>>>>>>> 
> >>>>>>>>> Disks aren't biological systems-- they don't heal over time.  Once
> >>>>>>>> they're
> >>>>>>>>> bad, they stay bad.  The log cleaner needs to be robust against
> >> cases
> >>>>>>>> where
> >>>>>>>>> the disk really is failing, and really is returning bad data or
> >>>>> timing
> >>>>>>>> out.
> >>>>>>>>>> * For your metrics, can you spell out the full metric in JMX-style
> >>>>>>>>>> format, such as:
> >>>>>>>>>> 
> >>>>>>>> kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
> >>>>>>>>>>              value=4
> >>>>>>>>>> 
> >>>>>>>>>> * For "uncleanable-partitions": topic-partition names can be very
> >>>>>>> long.
> >>>>>>>>>> I think the current max size is 210 characters (or maybe
> >> 240-ish?).
> >>>>>>>>>> Having the "uncleanable-partitions" being a list could be very
> >> large
> >>>>>>>>>> metric. Also, having the metric come out as a csv might be
> >> difficult
> >>>>>>> to
> >>>>>>>>>> work with for monitoring systems. If we *did* want the topic names
> >>>>> to
> >>>>>>>> be
> >>>>>>>>>> accessible, what do you think of having the
> >>>>>>>>>>      kafka.log:type=LogCleanerManager,topic=topic1,partition=2
> >>>>>>>>>> I'm not sure if LogCleanerManager is the right type, but my
> >> example
> >>>>>>> was
> >>>>>>>>>> that the topic and partition can be tags in the metric. That will
> >>>>>>> allow
> >>>>>>>>>> monitoring systems to more easily slice and dice the metric. I'm
> >> not
> >>>>>>>>>> sure what the attribute for that metric would be. Maybe something
> >>>>>>> like
> >>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
> >>>>> time-since-last-clean?
> >>>>>>>> Or
> >>>>>>>>>> maybe even just "Value=1".
> >>>>>>>>> I haven't though about this that hard, but do we really need the
> >>>>>>>>> uncleanable topic names to be accessible through a metric?  It
> >> seems
> >>>>>>> like
> >>>>>>>>> the admin should notice that uncleanable partitions are present,
> >> and
> >>>>>>> then
> >>>>>>>>> check the logs?
> >>>>>>>>> 
> >>>>>>>>>> * About `max.uncleanable.partitions`, you said that this likely
> >>>>>>>>>> indicates that the disk is having problems. I'm not sure that is
> >> the
> >>>>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
> >> problems,
> >>>>>>>> all
> >>>>>>>>>> of them are partition-level scenarios that happened during normal
> >>>>>>>>>> operation. None of them were indicative of disk problems.
> >>>>>>>>> I don't think this is a meaningful comparison.  In general, we
> >> don't
> >>>>>>>>> accept JIRAs for hard disk problems that happen on a particular
> >>>>>>> cluster.
> >>>>>>>>> If someone opened a JIRA that said "my hard disk is having
> >> problems"
> >>>>> we
> >>>>>>>>> could close that as "not a Kafka bug."  This doesn't prove that
> >> disk
> >>>>>>>>> problems don't happen, but  just that JIRA isn't the right place
> >> for
> >>>>>>>> them.
> >>>>>>>>> I do agree that the log cleaner has had a significant number of
> >> logic
> >>>>>>>>> bugs, and that we need to be careful to limit their impact.  That's
> >>>>> one
> >>>>>>>>> reason why I think that a threshold of "number of uncleanable logs"
> >>>>> is
> >>>>>>> a
> >>>>>>>>> good idea, rather than just failing after one IOException.  In all
> >>>>> the
> >>>>>>>>> cases I've seen where a user hit a logic bug in the log cleaner, it
> >>>>> was
> >>>>>>>>> just one partition that had the issue.  We also should increase
> >> test
> >>>>>>>>> coverage for the log cleaner.
> >>>>>>>>> 
> >>>>>>>>>> * About marking disks as offline when exceeding a certain
> >> threshold,
> >>>>>>>>>> that actually increases the blast radius of log compaction
> >> failures.
> >>>>>>>>>> Currently, the uncleaned partitions are still readable and
> >> writable.
> >>>>>>>>>> Taking the disks offline would impact availability of the
> >>>>> uncleanable
> >>>>>>>>>> partitions, as well as impact all other partitions that are on the
> >>>>>>>> disk.
> >>>>>>>>> In general, when we encounter I/O errors, we take the disk
> >> partition
> >>>>>>>>> offline.  This is spelled out in KIP-112 (
> >>>>>>>>> 
> >>>>>>> 
> >>>>> 
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
> >>>>>>>>> ) :
> >>>>>>>>> 
> >>>>>>>>>> - Broker assumes a log directory to be good after it starts, and
> >>>>> mark
> >>>>>>>>> log directory as
> >>>>>>>>>> bad once there is IOException when broker attempts to access (i.e.
> >>>>>>> read
> >>>>>>>>> or write) the log directory.
> >>>>>>>>>> - Broker will be offline if all log directories are bad.
> >>>>>>>>>> - Broker will stop serving replicas in any bad log directory. New
> >>>>>>>>> replicas will only be created
> >>>>>>>>>> on good log directory.
> >>>>>>>>> The behavior Stanislav is proposing for the log cleaner is actually
> >>>>>>> more
> >>>>>>>>> optimistic than what we do for regular broker I/O, since we will
> >>>>>>> tolerate
> >>>>>>>>> multiple IOExceptions, not just one.  But it's generally
> >> consistent.
> >>>>>>>>> Ignoring errors is not.  In any case, if you want to tolerate an
> >>>>>>>> unlimited
> >>>>>>>>> number of I/O errors, you can just set the threshold to an infinite
> >>>>>>> value
> >>>>>>>>> (although I think that would be a bad idea).
> >>>>>>>>> 
> >>>>>>>>> best,
> >>>>>>>>> Colin
> >>>>>>>>> 
> >>>>>>>>>> -James
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
> >>>>>>>>> stanislav@confluent.io> wrote:
> >>>>>>>>>>> I renamed the KIP and that changed the link. Sorry about that.
> >> Here
> >>>>>>>> is
> >>>>>>>>> the
> >>>>>>>>>>> new link:
> >>>>>>>>>>> 
> >>>>>>> 
> >>>>> 
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> >>>>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
> >>>>>>>>> stanislav@confluent.io>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>> 
> >>>>>>>>>>>> Hey group,
> >>>>>>>>>>>> 
> >>>>>>>>>>>> I created a new KIP about making log compaction more
> >>>>>>> fault-tolerant.
> >>>>>>>>>>>> Please give it a look here and please share what you think,
> >>>>>>>>> especially in
> >>>>>>>>>>>> regards to the points in the "Needs Discussion" paragraph.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> KIP: KIP-346
> >>>>>>>>>>>> <
> >>>>>>> 
> >>>>> 
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Stanislav
> >>>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> --
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Stanislav
> >>>>>> 
> >>>>> 
> >>>>> 
> >>>> 
> >>>> --
> >>>> Best,
> >>>> Stanislav
> >>>> 
> >>> 
> >>> 
> >>> --
> >>> Best,
> >>> Stanislav
> >> 
> > 
> > 
> > -- 
> > Best,
> > Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Ray Chiang <rc...@apache.org>.

I see this as a fix for the LogCleaner.  Uncaught exceptions kill the 
CleanerThread and this is viewed as undesired behavior.  Some other ways 
to think of this fix:

1) If you have occasional corruption in some log segments, then with 
each broker restart, the LogCleaner will lose its state, re-find all the 
corrupted log segments, and will skip them in future runs.  In these 
cases, you will see a non-zero value for uncleanable-partitions-count 
and look in the broker logs to see if this is fixable in some way or it 
will decrement once the log segment is no longer retained.

2) If you have increasing disk corruption, then this is a way to 
potentially catch increasing corruption.  It's not a perfect approach, 
but as we've discussed before, hard drive failures tend to cascade.  
This is a useful side effect of LogCleaning.

3) If you have a non-zero uncleanable-partitions-count, you can look in 
the logs, compare the replicated partitions across brokers, use 
DumpLogSegments to possibly find/fix/delete the corrupted record(s).  
Just from the cases I've seen, this type of corruption is fixable 
roughly 30% of the time.

-Ray


On 8/1/18 11:35 AM, James Cheng wrote:
> I’m a little confused about something. Is this KIP focused on log cleaner exceptions in general, or focused on log cleaner exceptions due to disk failures?
>
> Will max.uncleanable.partitions apply to all exceptions (including log cleaner logic errors) or will it apply to only disk I/o exceptions?
>
> I can understand taking the disk offline if there have been “N” I/O exceptions. Disk errors are user fixable (by replacing the affected disk). It turns an invisible (soft?) failure into a visible hard failure. And the I/O exceptions are possibly already causing problems, so it makes sense to limit their impact.
>
> But I’m not sure if it makes sense to take a disk offline after”N” logic errors in the log cleaner. If a log cleaner logic error happens, it’s rarely user fixable. And it will likely several partitions at once, so you’re likely to bump up against the max.uncleanable.partitions limit more quickly. If a disk was taken due to logic errors, I’m not sure what the user would do.
>
> -James
>
> Sent from my iPhone
>
>> On Aug 1, 2018, at 9:11 AM, Stanislav Kozlovski <st...@confluent.io> wrote:
>>
>> Yes, good catch. Thank you, James!
>>
>> Best,
>> Stanislav
>>
>>> On Wed, Aug 1, 2018 at 5:05 PM James Cheng <wu...@gmail.com> wrote:
>>>
>>> Can you update the KIP to say what the default is for
>>> max.uncleanable.partitions?
>>>
>>> -James
>>>
>>> Sent from my iPhone
>>>
>>>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <st...@confluent.io>
>>> wrote:
>>>> Hey group,
>>>>
>>>> I am planning on starting a voting thread tomorrow. Please do reply if
>>> you
>>>> feel there is anything left to discuss.
>>>>
>>>> Best,
>>>> Stanislav
>>>>
>>>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
>>> stanislav@confluent.io>
>>>> wrote:
>>>>
>>>>> Hey, Ray
>>>>>
>>>>> Thanks for pointing that out, it's fixed now
>>>>>
>>>>> Best,
>>>>> Stanislav
>>>>>
>>>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rc...@apache.org> wrote:
>>>>>>
>>>>>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
>>>>>> the main KIP landing page
>>>>>> <
>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#
>>>> ?
>>>>>> I tried, but the Wiki won't let me.
>>>>>>
>>>>>> -Ray
>>>>>>
>>>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
>>>>>>> Hey guys,
>>>>>>>
>>>>>>> @Colin - good point. I added some sentences mentioning recent
>>>>>> improvements
>>>>>>> in the introductory section.
>>>>>>>
>>>>>>> *Disk Failure* - I tend to agree with what Colin said - once a disk
>>>>>> fails,
>>>>>>> you don't want to work with it again. As such, I've changed my mind
>>> and
>>>>>>> believe that we should mark the LogDir (assume its a disk) as offline
>>> on
>>>>>>> the first `IOException` encountered. This is the LogCleaner's current
>>>>>>> behavior. We shouldn't change that.
>>>>>>>
>>>>>>> *Respawning Threads* - I believe we should never re-spawn a thread.
>>> The
>>>>>>> correct approach in my mind is to either have it stay dead or never
>>> let
>>>>>> it
>>>>>>> die in the first place.
>>>>>>>
>>>>>>> *Uncleanable-partition-names metric* - Colin is right, this metric is
>>>>>>> unneeded. Users can monitor the `uncleanable-partitions-count` metric
>>>>>> and
>>>>>>> inspect logs.
>>>>>>>
>>>>>>>
>>>>>>> Hey Ray,
>>>>>>>
>>>>>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner to
>>>>>>>> skip over problematic partitions instead of dying.
>>>>>>> I think we can do this for every exception that isn't `IOException`.
>>>>>> This
>>>>>>> will future-proof us against bugs in the system and potential other
>>>>>> errors.
>>>>>>> Protecting yourself against unexpected failures is always a good thing
>>>>>> in
>>>>>>> my mind, but I also think that protecting yourself against bugs in the
>>>>>>> software is sort of clunky. What does everybody think about this?
>>>>>>>
>>>>>>>> 4) The only improvement I can think of is that if such an
>>>>>>>> error occurs, then have the option (configuration setting?) to
>>> create a
>>>>>>>> <log_segment>.skip file (or something similar).
>>>>>>> This is a good suggestion. Have others also seen corruption be
>>> generally
>>>>>>> tied to the same segment?
>>>>>>>
>>>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dh...@confluent.io>
>>>>>> wrote:
>>>>>>>> For the cleaner thread specifically, I do not think respawning will
>>>>>> help at
>>>>>>>> all because we are more than likely to run into the same issue again
>>>>>> which
>>>>>>>> would end up crashing the cleaner. Retrying makes sense for transient
>>>>>>>> errors or when you believe some part of the system could have healed
>>>>>>>> itself, both of which I think are not true for the log cleaner.
>>>>>>>>
>>>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rn...@gmail.com>
>>>>>> wrote:
>>>>>>>>> <<<respawning threads is likely to make things worse, by putting you
>>>>>> in
>>>>>>>> an
>>>>>>>>> infinite loop which consumes resources and fires off continuous log
>>>>>>>>> messages.
>>>>>>>>> Hi Colin.  In case it could be relevant, one way to mitigate this
>>>>>> effect
>>>>>>>> is
>>>>>>>>> to implement a backoff mechanism (if a second respawn is to occur
>>> then
>>>>>>>> wait
>>>>>>>>> for 1 minute before doing it; then if a third respawn is to occur
>>> wait
>>>>>>>> for
>>>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to
>>> some
>>>>>> max
>>>>>>>>> wait time).
>>>>>>>>>
>>>>>>>>> I have no opinion on whether respawn is appropriate or not in this
>>>>>>>> context,
>>>>>>>>> but a mitigation like the increasing backoff described above may be
>>>>>>>>> relevant in weighing the pros and cons.
>>>>>>>>>
>>>>>>>>> Ron
>>>>>>>>>
>>>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cm...@apache.org>
>>>>>> wrote:
>>>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
>>>>>>>>>>> Hi Stanislav! Thanks for this KIP!
>>>>>>>>>>>
>>>>>>>>>>> I agree that it would be good if the LogCleaner were more tolerant
>>>>>> of
>>>>>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
>>>>>>>>>>>
>>>>>>>>>>> Things are better now than they used to be. We have the metric
>>>>>>>>>>>       kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
>>>>>>>>>>> which we can use to tell us if the threads are dead. And as of
>>>>>> 1.1.0,
>>>>>>>>> we
>>>>>>>>>>> have KIP-226, which allows you to restart the log cleaner thread,
>>>>>>>>>>> without requiring a broker restart.
>>>>>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>>>>>>>>>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>>>>>>>>>>> I've only read about this, I haven't personally tried it.
>>>>>>>>>> Thanks for pointing this out, James!  Stanislav, we should probably
>>>>>>>> add a
>>>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the
>>> KIP.
>>>>>>>>> Maybe
>>>>>>>>>> in the intro section?
>>>>>>>>>>
>>>>>>>>>> I think it's clear that requiring the users to manually restart the
>>>>>> log
>>>>>>>>>> cleaner is not a very good solution.  But it's good to know that
>>>>>> it's a
>>>>>>>>>> possibility on some older releases.
>>>>>>>>>>
>>>>>>>>>>> Some comments:
>>>>>>>>>>> * I like the idea of having the log cleaner continue to clean as
>>>>>> many
>>>>>>>>>>> partitions as it can, skipping over the problematic ones if
>>>>>> possible.
>>>>>>>>>>> * If the log cleaner thread dies, I think it should automatically
>>> be
>>>>>>>>>>> revived. Your KIP attempts to do that by catching exceptions
>>> during
>>>>>>>>>>> execution, but I think we should go all the way and make sure
>>> that a
>>>>>>>>> new
>>>>>>>>>>> one gets created, if the thread ever dies.
>>>>>>>>>> This is inconsistent with the way the rest of Kafka works.  We
>>> don't
>>>>>>>>>> automatically re-create other threads in the broker if they
>>>>>> terminate.
>>>>>>>>> In
>>>>>>>>>> general, if there is a serious bug in the code, respawning threads
>>> is
>>>>>>>>>> likely to make things worse, by putting you in an infinite loop
>>> which
>>>>>>>>>> consumes resources and fires off continuous log messages.
>>>>>>>>>>
>>>>>>>>>>> * It might be worth trying to re-clean the uncleanable partitions.
>>>>>>>> I've
>>>>>>>>>>> seen cases where an uncleanable partition later became cleanable.
>>> I
>>>>>>>>>>> unfortunately don't remember how that happened, but I remember
>>> being
>>>>>>>>>>> surprised when I discovered it. It might have been something like
>>> a
>>>>>>>>>>> follower was uncleanable but after a leader election happened, the
>>>>>>>> log
>>>>>>>>>>> truncated and it was then cleanable again. I'm not sure.
>>>>>>>>>> James, I disagree.  We had this behavior in the Hadoop Distributed
>>>>>> File
>>>>>>>>>> System (HDFS) and it was a constant source of user problems.
>>>>>>>>>>
>>>>>>>>>> What would happen is disks would just go bad over time.  The
>>> DataNode
>>>>>>>>>> would notice this and take them offline.  But then, due to some
>>>>>>>>>> "optimistic" code, the DataNode would periodically try to re-add
>>> them
>>>>>>>> to
>>>>>>>>>> the system.  Then one of two things would happen: the disk would
>>> just
>>>>>>>>> fail
>>>>>>>>>> immediately again, or it would appear to work and then fail after a
>>>>>>>> short
>>>>>>>>>> amount of time.
>>>>>>>>>>
>>>>>>>>>> The way the disk failed was normally having an I/O request take a
>>>>>>>> really
>>>>>>>>>> long time and time out.  So a bunch of request handler threads
>>> would
>>>>>>>>>> basically slam into a brick wall when they tried to access the bad
>>>>>>>> disk,
>>>>>>>>>> slowing the DataNode to a crawl.  It was even worse in the second
>>>>>>>>> scenario,
>>>>>>>>>> if the disk appeared to work for a while, but then failed.  Any
>>> data
>>>>>>>> that
>>>>>>>>>> had been written on that DataNode to that disk would be lost, and
>>> we
>>>>>>>>> would
>>>>>>>>>> need to re-replicate it.
>>>>>>>>>>
>>>>>>>>>> Disks aren't biological systems-- they don't heal over time.  Once
>>>>>>>>> they're
>>>>>>>>>> bad, they stay bad.  The log cleaner needs to be robust against
>>> cases
>>>>>>>>> where
>>>>>>>>>> the disk really is failing, and really is returning bad data or
>>>>>> timing
>>>>>>>>> out.
>>>>>>>>>>> * For your metrics, can you spell out the full metric in JMX-style
>>>>>>>>>>> format, such as:
>>>>>>>>>>>
>>>>>>>>> kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
>>>>>>>>>>>               value=4
>>>>>>>>>>>
>>>>>>>>>>> * For "uncleanable-partitions": topic-partition names can be very
>>>>>>>> long.
>>>>>>>>>>> I think the current max size is 210 characters (or maybe
>>> 240-ish?).
>>>>>>>>>>> Having the "uncleanable-partitions" being a list could be very
>>> large
>>>>>>>>>>> metric. Also, having the metric come out as a csv might be
>>> difficult
>>>>>>>> to
>>>>>>>>>>> work with for monitoring systems. If we *did* want the topic names
>>>>>> to
>>>>>>>>> be
>>>>>>>>>>> accessible, what do you think of having the
>>>>>>>>>>>       kafka.log:type=LogCleanerManager,topic=topic1,partition=2
>>>>>>>>>>> I'm not sure if LogCleanerManager is the right type, but my
>>> example
>>>>>>>> was
>>>>>>>>>>> that the topic and partition can be tags in the metric. That will
>>>>>>>> allow
>>>>>>>>>>> monitoring systems to more easily slice and dice the metric. I'm
>>> not
>>>>>>>>>>> sure what the attribute for that metric would be. Maybe something
>>>>>>>> like
>>>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
>>>>>> time-since-last-clean?
>>>>>>>>> Or
>>>>>>>>>>> maybe even just "Value=1".
>>>>>>>>>> I haven't though about this that hard, but do we really need the
>>>>>>>>>> uncleanable topic names to be accessible through a metric?  It
>>> seems
>>>>>>>> like
>>>>>>>>>> the admin should notice that uncleanable partitions are present,
>>> and
>>>>>>>> then
>>>>>>>>>> check the logs?
>>>>>>>>>>
>>>>>>>>>>> * About `max.uncleanable.partitions`, you said that this likely
>>>>>>>>>>> indicates that the disk is having problems. I'm not sure that is
>>> the
>>>>>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
>>> problems,
>>>>>>>>> all
>>>>>>>>>>> of them are partition-level scenarios that happened during normal
>>>>>>>>>>> operation. None of them were indicative of disk problems.
>>>>>>>>>> I don't think this is a meaningful comparison.  In general, we
>>> don't
>>>>>>>>>> accept JIRAs for hard disk problems that happen on a particular
>>>>>>>> cluster.
>>>>>>>>>> If someone opened a JIRA that said "my hard disk is having
>>> problems"
>>>>>> we
>>>>>>>>>> could close that as "not a Kafka bug."  This doesn't prove that
>>> disk
>>>>>>>>>> problems don't happen, but  just that JIRA isn't the right place
>>> for
>>>>>>>>> them.
>>>>>>>>>> I do agree that the log cleaner has had a significant number of
>>> logic
>>>>>>>>>> bugs, and that we need to be careful to limit their impact.  That's
>>>>>> one
>>>>>>>>>> reason why I think that a threshold of "number of uncleanable logs"
>>>>>> is
>>>>>>>> a
>>>>>>>>>> good idea, rather than just failing after one IOException.  In all
>>>>>> the
>>>>>>>>>> cases I've seen where a user hit a logic bug in the log cleaner, it
>>>>>> was
>>>>>>>>>> just one partition that had the issue.  We also should increase
>>> test
>>>>>>>>>> coverage for the log cleaner.
>>>>>>>>>>
>>>>>>>>>>> * About marking disks as offline when exceeding a certain
>>> threshold,
>>>>>>>>>>> that actually increases the blast radius of log compaction
>>> failures.
>>>>>>>>>>> Currently, the uncleaned partitions are still readable and
>>> writable.
>>>>>>>>>>> Taking the disks offline would impact availability of the
>>>>>> uncleanable
>>>>>>>>>>> partitions, as well as impact all other partitions that are on the
>>>>>>>>> disk.
>>>>>>>>>> In general, when we encounter I/O errors, we take the disk
>>> partition
>>>>>>>>>> offline.  This is spelled out in KIP-112 (
>>>>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
>>>>>>>>>> ) :
>>>>>>>>>>
>>>>>>>>>>> - Broker assumes a log directory to be good after it starts, and
>>>>>> mark
>>>>>>>>>> log directory as
>>>>>>>>>>> bad once there is IOException when broker attempts to access (i.e.
>>>>>>>> read
>>>>>>>>>> or write) the log directory.
>>>>>>>>>>> - Broker will be offline if all log directories are bad.
>>>>>>>>>>> - Broker will stop serving replicas in any bad log directory. New
>>>>>>>>>> replicas will only be created
>>>>>>>>>>> on good log directory.
>>>>>>>>>> The behavior Stanislav is proposing for the log cleaner is actually
>>>>>>>> more
>>>>>>>>>> optimistic than what we do for regular broker I/O, since we will
>>>>>>>> tolerate
>>>>>>>>>> multiple IOExceptions, not just one.  But it's generally
>>> consistent.
>>>>>>>>>> Ignoring errors is not.  In any case, if you want to tolerate an
>>>>>>>>> unlimited
>>>>>>>>>> number of I/O errors, you can just set the threshold to an infinite
>>>>>>>> value
>>>>>>>>>> (although I think that would be a bad idea).
>>>>>>>>>>
>>>>>>>>>> best,
>>>>>>>>>> Colin
>>>>>>>>>>
>>>>>>>>>>> -James
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
>>>>>>>>>> stanislav@confluent.io> wrote:
>>>>>>>>>>>> I renamed the KIP and that changed the link. Sorry about that.
>>> Here
>>>>>>>>> is
>>>>>>>>>> the
>>>>>>>>>>>> new link:
>>>>>>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
>>>>>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
>>>>>>>>>> stanislav@confluent.io>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hey group,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I created a new KIP about making log compaction more
>>>>>>>> fault-tolerant.
>>>>>>>>>>>>> Please give it a look here and please share what you think,
>>>>>>>>>> especially in
>>>>>>>>>>>>> regards to the points in the "Needs Discussion" paragraph.
>>>>>>>>>>>>>
>>>>>>>>>>>>> KIP: KIP-346
>>>>>>>>>>>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Stanislav
>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Stanislav
>>>>>>
>>>>> --
>>>>> Best,
>>>>> Stanislav
>>>>>
>>>>
>>>> --
>>>> Best,
>>>> Stanislav
>>
>> -- 
>> Best,
>> Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by James Cheng <wu...@gmail.com>.

I’m a little confused about something. Is this KIP focused on log cleaner exceptions in general, or focused on log cleaner exceptions due to disk failures?

Will max.uncleanable.partitions apply to all exceptions (including log cleaner logic errors) or will it apply to only disk I/o exceptions?

I can understand taking the disk offline if there have been “N” I/O exceptions. Disk errors are user fixable (by replacing the affected disk). It turns an invisible (soft?) failure into a visible hard failure. And the I/O exceptions are possibly already causing problems, so it makes sense to limit their impact.

But I’m not sure if it makes sense to take a disk offline after”N” logic errors in the log cleaner. If a log cleaner logic error happens, it’s rarely user fixable. And it will likely several partitions at once, so you’re likely to bump up against the max.uncleanable.partitions limit more quickly. If a disk was taken due to logic errors, I’m not sure what the user would do.

-James

Sent from my iPhone

> On Aug 1, 2018, at 9:11 AM, Stanislav Kozlovski <st...@confluent.io> wrote:
> 
> Yes, good catch. Thank you, James!
> 
> Best,
> Stanislav
> 
>> On Wed, Aug 1, 2018 at 5:05 PM James Cheng <wu...@gmail.com> wrote:
>> 
>> Can you update the KIP to say what the default is for
>> max.uncleanable.partitions?
>> 
>> -James
>> 
>> Sent from my iPhone
>> 
>>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <st...@confluent.io>
>> wrote:
>>> 
>>> Hey group,
>>> 
>>> I am planning on starting a voting thread tomorrow. Please do reply if
>> you
>>> feel there is anything left to discuss.
>>> 
>>> Best,
>>> Stanislav
>>> 
>>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
>> stanislav@confluent.io>
>>> wrote:
>>> 
>>>> Hey, Ray
>>>> 
>>>> Thanks for pointing that out, it's fixed now
>>>> 
>>>> Best,
>>>> Stanislav
>>>> 
>>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rc...@apache.org> wrote:
>>>>> 
>>>>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
>>>>> the main KIP landing page
>>>>> <
>>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#
>>> ?
>>>>> 
>>>>> I tried, but the Wiki won't let me.
>>>>> 
>>>>> -Ray
>>>>> 
>>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
>>>>>> Hey guys,
>>>>>> 
>>>>>> @Colin - good point. I added some sentences mentioning recent
>>>>> improvements
>>>>>> in the introductory section.
>>>>>> 
>>>>>> *Disk Failure* - I tend to agree with what Colin said - once a disk
>>>>> fails,
>>>>>> you don't want to work with it again. As such, I've changed my mind
>> and
>>>>>> believe that we should mark the LogDir (assume its a disk) as offline
>> on
>>>>>> the first `IOException` encountered. This is the LogCleaner's current
>>>>>> behavior. We shouldn't change that.
>>>>>> 
>>>>>> *Respawning Threads* - I believe we should never re-spawn a thread.
>> The
>>>>>> correct approach in my mind is to either have it stay dead or never
>> let
>>>>> it
>>>>>> die in the first place.
>>>>>> 
>>>>>> *Uncleanable-partition-names metric* - Colin is right, this metric is
>>>>>> unneeded. Users can monitor the `uncleanable-partitions-count` metric
>>>>> and
>>>>>> inspect logs.
>>>>>> 
>>>>>> 
>>>>>> Hey Ray,
>>>>>> 
>>>>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner to
>>>>>>> skip over problematic partitions instead of dying.
>>>>>> I think we can do this for every exception that isn't `IOException`.
>>>>> This
>>>>>> will future-proof us against bugs in the system and potential other
>>>>> errors.
>>>>>> Protecting yourself against unexpected failures is always a good thing
>>>>> in
>>>>>> my mind, but I also think that protecting yourself against bugs in the
>>>>>> software is sort of clunky. What does everybody think about this?
>>>>>> 
>>>>>>> 4) The only improvement I can think of is that if such an
>>>>>>> error occurs, then have the option (configuration setting?) to
>> create a
>>>>>>> <log_segment>.skip file (or something similar).
>>>>>> This is a good suggestion. Have others also seen corruption be
>> generally
>>>>>> tied to the same segment?
>>>>>> 
>>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dh...@confluent.io>
>>>>> wrote:
>>>>>> 
>>>>>>> For the cleaner thread specifically, I do not think respawning will
>>>>> help at
>>>>>>> all because we are more than likely to run into the same issue again
>>>>> which
>>>>>>> would end up crashing the cleaner. Retrying makes sense for transient
>>>>>>> errors or when you believe some part of the system could have healed
>>>>>>> itself, both of which I think are not true for the log cleaner.
>>>>>>> 
>>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rn...@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> <<<respawning threads is likely to make things worse, by putting you
>>>>> in
>>>>>>> an
>>>>>>>> infinite loop which consumes resources and fires off continuous log
>>>>>>>> messages.
>>>>>>>> Hi Colin.  In case it could be relevant, one way to mitigate this
>>>>> effect
>>>>>>> is
>>>>>>>> to implement a backoff mechanism (if a second respawn is to occur
>> then
>>>>>>> wait
>>>>>>>> for 1 minute before doing it; then if a third respawn is to occur
>> wait
>>>>>>> for
>>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to
>> some
>>>>> max
>>>>>>>> wait time).
>>>>>>>> 
>>>>>>>> I have no opinion on whether respawn is appropriate or not in this
>>>>>>> context,
>>>>>>>> but a mitigation like the increasing backoff described above may be
>>>>>>>> relevant in weighing the pros and cons.
>>>>>>>> 
>>>>>>>> Ron
>>>>>>>> 
>>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cm...@apache.org>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
>>>>>>>>>> Hi Stanislav! Thanks for this KIP!
>>>>>>>>>> 
>>>>>>>>>> I agree that it would be good if the LogCleaner were more tolerant
>>>>> of
>>>>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
>>>>>>>>>> 
>>>>>>>>>> Things are better now than they used to be. We have the metric
>>>>>>>>>>      kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
>>>>>>>>>> which we can use to tell us if the threads are dead. And as of
>>>>> 1.1.0,
>>>>>>>> we
>>>>>>>>>> have KIP-226, which allows you to restart the log cleaner thread,
>>>>>>>>>> without requiring a broker restart.
>>>>>>>>>> 
>>>>>>> 
>>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>>>>>>>>>> <
>>>>>>> 
>>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>>>>>>>>> 
>>>>>>>>>> I've only read about this, I haven't personally tried it.
>>>>>>>>> Thanks for pointing this out, James!  Stanislav, we should probably
>>>>>>> add a
>>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the
>> KIP.
>>>>>>>> Maybe
>>>>>>>>> in the intro section?
>>>>>>>>> 
>>>>>>>>> I think it's clear that requiring the users to manually restart the
>>>>> log
>>>>>>>>> cleaner is not a very good solution.  But it's good to know that
>>>>> it's a
>>>>>>>>> possibility on some older releases.
>>>>>>>>> 
>>>>>>>>>> Some comments:
>>>>>>>>>> * I like the idea of having the log cleaner continue to clean as
>>>>> many
>>>>>>>>>> partitions as it can, skipping over the problematic ones if
>>>>> possible.
>>>>>>>>>> 
>>>>>>>>>> * If the log cleaner thread dies, I think it should automatically
>> be
>>>>>>>>>> revived. Your KIP attempts to do that by catching exceptions
>> during
>>>>>>>>>> execution, but I think we should go all the way and make sure
>> that a
>>>>>>>> new
>>>>>>>>>> one gets created, if the thread ever dies.
>>>>>>>>> This is inconsistent with the way the rest of Kafka works.  We
>> don't
>>>>>>>>> automatically re-create other threads in the broker if they
>>>>> terminate.
>>>>>>>> In
>>>>>>>>> general, if there is a serious bug in the code, respawning threads
>> is
>>>>>>>>> likely to make things worse, by putting you in an infinite loop
>> which
>>>>>>>>> consumes resources and fires off continuous log messages.
>>>>>>>>> 
>>>>>>>>>> * It might be worth trying to re-clean the uncleanable partitions.
>>>>>>> I've
>>>>>>>>>> seen cases where an uncleanable partition later became cleanable.
>> I
>>>>>>>>>> unfortunately don't remember how that happened, but I remember
>> being
>>>>>>>>>> surprised when I discovered it. It might have been something like
>> a
>>>>>>>>>> follower was uncleanable but after a leader election happened, the
>>>>>>> log
>>>>>>>>>> truncated and it was then cleanable again. I'm not sure.
>>>>>>>>> James, I disagree.  We had this behavior in the Hadoop Distributed
>>>>> File
>>>>>>>>> System (HDFS) and it was a constant source of user problems.
>>>>>>>>> 
>>>>>>>>> What would happen is disks would just go bad over time.  The
>> DataNode
>>>>>>>>> would notice this and take them offline.  But then, due to some
>>>>>>>>> "optimistic" code, the DataNode would periodically try to re-add
>> them
>>>>>>> to
>>>>>>>>> the system.  Then one of two things would happen: the disk would
>> just
>>>>>>>> fail
>>>>>>>>> immediately again, or it would appear to work and then fail after a
>>>>>>> short
>>>>>>>>> amount of time.
>>>>>>>>> 
>>>>>>>>> The way the disk failed was normally having an I/O request take a
>>>>>>> really
>>>>>>>>> long time and time out.  So a bunch of request handler threads
>> would
>>>>>>>>> basically slam into a brick wall when they tried to access the bad
>>>>>>> disk,
>>>>>>>>> slowing the DataNode to a crawl.  It was even worse in the second
>>>>>>>> scenario,
>>>>>>>>> if the disk appeared to work for a while, but then failed.  Any
>> data
>>>>>>> that
>>>>>>>>> had been written on that DataNode to that disk would be lost, and
>> we
>>>>>>>> would
>>>>>>>>> need to re-replicate it.
>>>>>>>>> 
>>>>>>>>> Disks aren't biological systems-- they don't heal over time.  Once
>>>>>>>> they're
>>>>>>>>> bad, they stay bad.  The log cleaner needs to be robust against
>> cases
>>>>>>>> where
>>>>>>>>> the disk really is failing, and really is returning bad data or
>>>>> timing
>>>>>>>> out.
>>>>>>>>>> * For your metrics, can you spell out the full metric in JMX-style
>>>>>>>>>> format, such as:
>>>>>>>>>> 
>>>>>>>> kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
>>>>>>>>>>              value=4
>>>>>>>>>> 
>>>>>>>>>> * For "uncleanable-partitions": topic-partition names can be very
>>>>>>> long.
>>>>>>>>>> I think the current max size is 210 characters (or maybe
>> 240-ish?).
>>>>>>>>>> Having the "uncleanable-partitions" being a list could be very
>> large
>>>>>>>>>> metric. Also, having the metric come out as a csv might be
>> difficult
>>>>>>> to
>>>>>>>>>> work with for monitoring systems. If we *did* want the topic names
>>>>> to
>>>>>>>> be
>>>>>>>>>> accessible, what do you think of having the
>>>>>>>>>>      kafka.log:type=LogCleanerManager,topic=topic1,partition=2
>>>>>>>>>> I'm not sure if LogCleanerManager is the right type, but my
>> example
>>>>>>> was
>>>>>>>>>> that the topic and partition can be tags in the metric. That will
>>>>>>> allow
>>>>>>>>>> monitoring systems to more easily slice and dice the metric. I'm
>> not
>>>>>>>>>> sure what the attribute for that metric would be. Maybe something
>>>>>>> like
>>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
>>>>> time-since-last-clean?
>>>>>>>> Or
>>>>>>>>>> maybe even just "Value=1".
>>>>>>>>> I haven't though about this that hard, but do we really need the
>>>>>>>>> uncleanable topic names to be accessible through a metric?  It
>> seems
>>>>>>> like
>>>>>>>>> the admin should notice that uncleanable partitions are present,
>> and
>>>>>>> then
>>>>>>>>> check the logs?
>>>>>>>>> 
>>>>>>>>>> * About `max.uncleanable.partitions`, you said that this likely
>>>>>>>>>> indicates that the disk is having problems. I'm not sure that is
>> the
>>>>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
>> problems,
>>>>>>>> all
>>>>>>>>>> of them are partition-level scenarios that happened during normal
>>>>>>>>>> operation. None of them were indicative of disk problems.
>>>>>>>>> I don't think this is a meaningful comparison.  In general, we
>> don't
>>>>>>>>> accept JIRAs for hard disk problems that happen on a particular
>>>>>>> cluster.
>>>>>>>>> If someone opened a JIRA that said "my hard disk is having
>> problems"
>>>>> we
>>>>>>>>> could close that as "not a Kafka bug."  This doesn't prove that
>> disk
>>>>>>>>> problems don't happen, but  just that JIRA isn't the right place
>> for
>>>>>>>> them.
>>>>>>>>> I do agree that the log cleaner has had a significant number of
>> logic
>>>>>>>>> bugs, and that we need to be careful to limit their impact.  That's
>>>>> one
>>>>>>>>> reason why I think that a threshold of "number of uncleanable logs"
>>>>> is
>>>>>>> a
>>>>>>>>> good idea, rather than just failing after one IOException.  In all
>>>>> the
>>>>>>>>> cases I've seen where a user hit a logic bug in the log cleaner, it
>>>>> was
>>>>>>>>> just one partition that had the issue.  We also should increase
>> test
>>>>>>>>> coverage for the log cleaner.
>>>>>>>>> 
>>>>>>>>>> * About marking disks as offline when exceeding a certain
>> threshold,
>>>>>>>>>> that actually increases the blast radius of log compaction
>> failures.
>>>>>>>>>> Currently, the uncleaned partitions are still readable and
>> writable.
>>>>>>>>>> Taking the disks offline would impact availability of the
>>>>> uncleanable
>>>>>>>>>> partitions, as well as impact all other partitions that are on the
>>>>>>>> disk.
>>>>>>>>> In general, when we encounter I/O errors, we take the disk
>> partition
>>>>>>>>> offline.  This is spelled out in KIP-112 (
>>>>>>>>> 
>>>>>>> 
>>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
>>>>>>>>> ) :
>>>>>>>>> 
>>>>>>>>>> - Broker assumes a log directory to be good after it starts, and
>>>>> mark
>>>>>>>>> log directory as
>>>>>>>>>> bad once there is IOException when broker attempts to access (i.e.
>>>>>>> read
>>>>>>>>> or write) the log directory.
>>>>>>>>>> - Broker will be offline if all log directories are bad.
>>>>>>>>>> - Broker will stop serving replicas in any bad log directory. New
>>>>>>>>> replicas will only be created
>>>>>>>>>> on good log directory.
>>>>>>>>> The behavior Stanislav is proposing for the log cleaner is actually
>>>>>>> more
>>>>>>>>> optimistic than what we do for regular broker I/O, since we will
>>>>>>> tolerate
>>>>>>>>> multiple IOExceptions, not just one.  But it's generally
>> consistent.
>>>>>>>>> Ignoring errors is not.  In any case, if you want to tolerate an
>>>>>>>> unlimited
>>>>>>>>> number of I/O errors, you can just set the threshold to an infinite
>>>>>>> value
>>>>>>>>> (although I think that would be a bad idea).
>>>>>>>>> 
>>>>>>>>> best,
>>>>>>>>> Colin
>>>>>>>>> 
>>>>>>>>>> -James
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
>>>>>>>>> stanislav@confluent.io> wrote:
>>>>>>>>>>> I renamed the KIP and that changed the link. Sorry about that.
>> Here
>>>>>>>> is
>>>>>>>>> the
>>>>>>>>>>> new link:
>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
>>>>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
>>>>>>>>> stanislav@confluent.io>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hey group,
>>>>>>>>>>>> 
>>>>>>>>>>>> I created a new KIP about making log compaction more
>>>>>>> fault-tolerant.
>>>>>>>>>>>> Please give it a look here and please share what you think,
>>>>>>>>> especially in
>>>>>>>>>>>> regards to the points in the "Needs Discussion" paragraph.
>>>>>>>>>>>> 
>>>>>>>>>>>> KIP: KIP-346
>>>>>>>>>>>> <
>>>>>>> 
>>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
>>>>>>>>>>>> --
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Stanislav
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Best,
>>>>>>>>>>> Stanislav
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> Best,
>>>> Stanislav
>>>> 
>>> 
>>> 
>>> --
>>> Best,
>>> Stanislav
>> 
> 
> 
> -- 
> Best,
> Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Jason Gustafson <ja...@confluent.io>.

>
> The issue with the __consumer_offsets topic is problematic, that is true.
> Nevertheless, I have some concerns with having a certain threshold of
> `uncleanable.bytes`. There is now a chance that a single error in a big
> partition (other than __consumer_offsets) marks the directory as offline
> outright. To avoid this, we would need to have it be set to *at least* half
> of the biggest compacted partition's size - this is since the default of
> `log.cleaner.min.cleanable.ratio` is 0.5. Even then, that single partition
> will quickly go over the threshold since it is not cleaned at all.


That's a fair point. I was thinking we could compute the size from the
dirty offset, but it's true that the uncleanable size could already be
large at the time of the failure.

I am now left with the conclusion that it's best to have that functionality
> be disabled by default. Since configs are relatively easy to add but hard
> to take away, I believe it might be best to drop off that functionality in
> this KIP. We could consider adding it later if the community believes it is
> needed.


Yeah, I think if we're not convinced we have the right solution, it might
be best to leave the config for potential future work. The main improvement
here is allowing the cleaner to continue in spite of failures on given
partitions.

Thanks,
Jason



On Wed, Aug 15, 2018 at 9:40 AM, Stanislav Kozlovski <stanislav@confluent.io
> wrote:

> Hi Jason,
>
> I was thinking about your suggestion. I agree that it makes sense to cap it
> at a certain threshold and it doesn't sound *too* restrictive to me either,
> considering the common case.
>
> The issue with the __consumer_offsets topic is problematic, that is true.
> Nevertheless, I have some concerns with having a certain threshold of
> `uncleanable.bytes`. There is now a chance that a single error in a big
> partition (other than __consumer_offsets) marks the directory as offline
> outright. To avoid this, we would need to have it be set to *at least* half
> of the biggest compacted partition's size - this is since the default of
> `log.cleaner.min.cleanable.ratio` is 0.5. Even then, that single partition
> will quickly go over the threshold since it is not cleaned at all.
>
> Ideally, we'd want to validate that more partitions are failing before
> marking the disk as offline to best ensure it is an actual disk problem.
> Having a threshold makes this tricky. Placing a reasonable default value
> seems very hard too, as it would either be too small (mark too fast) or too
> big (never mark offline) for some users, which would cause issues in the
> former case. Perhaps the best approach would be to have the functionality
> be disabled by default.
>
> I am now left with the conclusion that it's best to have that functionality
> be disabled by default. Since configs are relatively easy to add but hard
> to take away, I believe it might be best to drop off that functionality in
> this KIP. We could consider adding it later if the community believes it is
> needed.
> I consider that a reasonable approach, since the main perceived benefit of
> this KIP is the isolation of partition failures and to some extent the new
> metrics.
>
> What are other people's thoughts on this? I have updated the KIP
> accordingly.
>
> Best,
> Stanislav
>
> On Wed, Aug 15, 2018 at 12:27 AM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Sorry for the noise. Let me try again:
> >
> > My initial suggestion was to *track *the uncleanable disk space.
> > > I can see why marking a log directory as offline after a certain
> > threshold
> > > of uncleanable disk space is more useful.
> > > I'm not sure if we can set that threshold to be of certain size (e.g
> > 100GB)
> > > as log directories might have different sizes.  Maybe a percentage
> would
> > be
> > > better then (e.g 30% of whole log dir size), WDYT?
> >
> >
> > The two most common problems I am aware of when the log cleaner crashes
> are
> > 1) running out of disk space and 2) excessive coordinator loading time.
> The
> > problem in the latter case is that when the log cleaner is not running,
> the
> > __consumer_offsets topics can become huge. If there is a failure which
> > causes a coordinator change, then it can take a long time for the new
> > coordinator to load the offset cache since it reads from the beginning.
> > Consumers are effectively dead in the water when this happens since they
> > cannot commit offsets. We've seen coordinator loading times in the hours
> > for some users. If we could set a total cap on the uncleanable size, then
> > we can reduce the impact from unbounded __consumer_offsets growth.
> >
> > Also it's true that log directories may have different sizes, but I'm not
> > sure that is a common case. I don't think it would be too restrictive to
> > use a single max size for all directories. I think the key is just having
> > some way to cap the size of the uncleaned data.
> >
> > I feel it still makes sense to have a metric tracking how many
> uncleanable
> > > partitions there are and the total amount of uncleanable disk space
> (per
> > > log dir, via a JMX tag).
> > > But now, rather than fail the log directory after a certain count of
> > > uncleanable partitions, we could fail it after a certain percentage (or
> > > size) of its storage is uncleanable.
> >
> >
> > Yes, having the metric for uncleanable partitions could be useful. I was
> > mostly concerned about the corresponding config since it didn't seem to
> > address the main problems with the cleaner dying.
> >
> > Thanks,
> > Jason
> >
> > On Tue, Aug 14, 2018 at 4:11 PM, Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > Hey Stanislav, responses below:
> > >
> > > My initial suggestion was to *track *the uncleanable disk space.
> > >> I can see why marking a log directory as offline after a certain
> > threshold
> > >> of uncleanable disk space is more useful.
> > >> I'm not sure if we can set that threshold to be of certain size (e.g
> > >> 100GB)
> > >> as log directories might have different sizes.  Maybe a percentage
> would
> > >> be
> > >> better then (e.g 30% of whole log dir size), WDYT?
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Aug 10, 2018 at 2:05 AM, Stanislav Kozlovski <
> > > stanislav@confluent.io> wrote:
> > >
> > >> Hey Jason,
> > >>
> > >> My initial suggestion was to *track *the uncleanable disk space.
> > >> I can see why marking a log directory as offline after a certain
> > threshold
> > >> of uncleanable disk space is more useful.
> > >> I'm not sure if we can set that threshold to be of certain size (e.g
> > >> 100GB)
> > >> as log directories might have different sizes.  Maybe a percentage
> would
> > >> be
> > >> better then (e.g 30% of whole log dir size), WDYT?
> > >>
> > >> I feel it still makes sense to have a metric tracking how many
> > uncleanable
> > >> partitions there are and the total amount of uncleanable disk space
> (per
> > >> log dir, via a JMX tag).
> > >> But now, rather than fail the log directory after a certain count of
> > >> uncleanable partitions, we could fail it after a certain percentage
> (or
> > >> size) of its storage is uncleanable.
> > >>
> > >> I'd like to hear other people's thoughts on this. Sound good?
> > >>
> > >> Best,
> > >> Stanislav
> > >>
> > >>
> > >>
> > >>
> > >> On Fri, Aug 10, 2018 at 12:40 AM Jason Gustafson <ja...@confluent.io>
> > >> wrote:
> > >>
> > >> > Hey Stanislav,
> > >> >
> > >> > Sorry, I was probably looking at an older version (I had the tab
> open
> > >> for
> > >> > so long!).
> > >> >
> > >> > I have been thinking about `max.uncleanable.partitions` and
> wondering
> > if
> > >> > it's what we really want. The main risk if the cleaner cannot clean
> a
> > >> > partition is eventually running out of disk space. This is the most
> > >> common
> > >> > problem we have seen with cleaner failures and it can happen even if
> > >> there
> > >> > is just one uncleanable partition. We've actually seen cases in
> which
> > a
> > >> > single __consumer_offsets grew large enough to fill a significant
> > >> portion
> > >> > of the disk. The difficulty with allowing a system to run out of
> disk
> > >> space
> > >> > before failing is that it makes recovery difficult and time
> consuming.
> > >> > Clean shutdown, for example, requires writing some state to disk.
> > >> Without
> > >> > clean shutdown, it can take the broker significantly longer to
> startup
> > >> > because it has do more segment recovery.
> > >> >
> > >> > For this problem, `max.uncleanable.partitions` does not really help.
> > You
> > >> > can set it to 1 and fail fast, but that is not much better than the
> > >> > existing state. You had a suggestion previously in the KIP to use
> the
> > >> size
> > >> > of uncleanable disk space instead. What was the reason for rejecting
> > >> that?
> > >> > Intuitively, it seems like a better fit for a cleaner failure. It
> > would
> > >> > provide users some time to react to failures while still protecting
> > them
> > >> > from exhausting the disk.
> > >> >
> > >> > Thanks,
> > >> > Jason
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Thu, Aug 9, 2018 at 9:46 AM, Stanislav Kozlovski <
> > >> > stanislav@confluent.io>
> > >> > wrote:
> > >> >
> > >> > > Hey Jason,
> > >> > >
> > >> > > 1. *10* is the default value, it says so in the KIP
> > >> > > 2. This is a good catch. As the current implementation stands,
> it's
> > >> not a
> > >> > > useful metric since the thread continues to run even if all log
> > >> > directories
> > >> > > are offline (although I'm not sure what the broker's behavior is
> in
> > >> that
> > >> > > scenario). I'll make sure the thread stops if all log directories
> > are
> > >> > > online.
> > >> > >
> > >> > > I don't know which "Needs Discussion" item you're referencing,
> there
> > >> > hasn't
> > >> > > been any in the KIP since August 1 and that was for the metric
> only.
> > >> KIP
> > >> > > History
> > >> > > <https://cwiki.apache.org/confluence/pages/viewpreviousversi
> > >> ons.action?
> > >> > > pageId=89064875>
> > >> > >
> > >> > > I've updated the KIP to mention the "time-since-last-run" metric.
> > >> > >
> > >> > > Thanks,
> > >> > > Stanislav
> > >> > >
> > >> > > On Wed, Aug 8, 2018 at 12:12 AM Jason Gustafson <
> jason@confluent.io
> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Hi Stanislav,
> > >> > > >
> > >> > > > Just a couple quick questions:
> > >> > > >
> > >> > > > 1. I may have missed it, but what will be the default value for
> > >> > > > `max.uncleanable.partitions`?
> > >> > > > 2. It seems there will be some impact for users that monitoring
> > >> > > > "time-since-last-run-ms" in order to detect cleaner failures.
> Not
> > >> sure
> > >> > > it's
> > >> > > > a major concern, but probably worth mentioning in the
> > compatibility
> > >> > > > section. Also, is this still a useful metric after this KIP?
> > >> > > >
> > >> > > > Also, maybe the "Needs Discussion" item can be moved to rejected
> > >> > > > alternatives since you've moved to a vote? I think leaving this
> > for
> > >> > > > potential future work is reasonable.
> > >> > > >
> > >> > > > Thanks,
> > >> > > > Jason
> > >> > > >
> > >> > > >
> > >> > > > On Mon, Aug 6, 2018 at 12:29 PM, Ray Chiang <rchiang@apache.org
> >
> > >> > wrote:
> > >> > > >
> > >> > > > > I'm okay with that.
> > >> > > > >
> > >> > > > > -Ray
> > >> > > > >
> > >> > > > > On 8/6/18 10:59 AM, Colin McCabe wrote:
> > >> > > > >
> > >> > > > >> Perhaps we could start with max.uncleanable.partitions and
> then
> > >> > > > implement
> > >> > > > >> max.uncleanable.partitions.per.logdir in a follow-up change
> if
> > >> it
> > >> > > seemed
> > >> > > > >> to be necessary?  What do you think?
> > >> > > > >>
> > >> > > > >> regards,
> > >> > > > >> Colin
> > >> > > > >>
> > >> > > > >>
> > >> > > > >> On Sat, Aug 4, 2018, at 10:53, Stanislav Kozlovski wrote:
> > >> > > > >>
> > >> > > > >>> Hey Ray,
> > >> > > > >>>
> > >> > > > >>> Thanks for the explanation. In regards to the configuration
> > >> > property
> > >> > > -
> > >> > > > >>> I'm
> > >> > > > >>> not sure. As long as it has sufficient documentation, I find
> > >> > > > >>> "max.uncleanable.partitions" to be okay. If we were to add
> the
> > >> > > > >>> distinction
> > >> > > > >>> explicitly, maybe it should be `max.uncleanable.partitions.
> > >> > > per.logdir`
> > >> > > > ?
> > >> > > > >>>
> > >> > > > >>> On Thu, Aug 2, 2018 at 7:32 PM Ray Chiang <
> rchiang@apache.org
> > >
> > >> > > wrote:
> > >> > > > >>>
> > >> > > > >>> One more thing occurred to me.  Should the configuration
> > >> property
> > >> > be
> > >> > > > >>>> named "max.uncleanable.partitions.per.disk" instead?
> > >> > > > >>>>
> > >> > > > >>>> -Ray
> > >> > > > >>>>
> > >> > > > >>>>
> > >> > > > >>>> On 8/1/18 9:11 AM, Stanislav Kozlovski wrote:
> > >> > > > >>>>
> > >> > > > >>>>> Yes, good catch. Thank you, James!
> > >> > > > >>>>>
> > >> > > > >>>>> Best,
> > >> > > > >>>>> Stanislav
> > >> > > > >>>>>
> > >> > > > >>>>> On Wed, Aug 1, 2018 at 5:05 PM James Cheng <
> > >> wushujames@gmail.com
> > >> > >
> > >> > > > >>>>> wrote:
> > >> > > > >>>>>
> > >> > > > >>>>> Can you update the KIP to say what the default is for
> > >> > > > >>>>>> max.uncleanable.partitions?
> > >> > > > >>>>>>
> > >> > > > >>>>>> -James
> > >> > > > >>>>>>
> > >> > > > >>>>>> Sent from my iPhone
> > >> > > > >>>>>>
> > >> > > > >>>>>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <
> > >> > > > >>>>>>>
> > >> > > > >>>>>> stanislav@confluent.io>
> > >> > > > >>>>
> > >> > > > >>>>> wrote:
> > >> > > > >>>>>>
> > >> > > > >>>>>>> Hey group,
> > >> > > > >>>>>>>
> > >> > > > >>>>>>> I am planning on starting a voting thread tomorrow.
> Please
> > >> do
> > >> > > reply
> > >> > > > >>>>>>> if
> > >> > > > >>>>>>>
> > >> > > > >>>>>> you
> > >> > > > >>>>>>
> > >> > > > >>>>>>> feel there is anything left to discuss.
> > >> > > > >>>>>>>
> > >> > > > >>>>>>> Best,
> > >> > > > >>>>>>> Stanislav
> > >> > > > >>>>>>>
> > >> > > > >>>>>>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
> > >> > > > >>>>>>>
> > >> > > > >>>>>> stanislav@confluent.io>
> > >> > > > >>>>>>
> > >> > > > >>>>>>> wrote:
> > >> > > > >>>>>>>
> > >> > > > >>>>>>> Hey, Ray
> > >> > > > >>>>>>>>
> > >> > > > >>>>>>>> Thanks for pointing that out, it's fixed now
> > >> > > > >>>>>>>>
> > >> > > > >>>>>>>> Best,
> > >> > > > >>>>>>>> Stanislav
> > >> > > > >>>>>>>>
> > >> > > > >>>>>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <
> > >> > rchiang@apache.org>
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>> wrote:
> > >> > > > >>>>
> > >> > > > >>>>> Thanks.  Can you fix the link in the "KIPs under
> discussion"
> > >> > table
> > >> > > on
> > >> > > > >>>>>>>>> the main KIP landing page
> > >> > > > >>>>>>>>> <
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>
> > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+
> > >> > > > >>>> Improvement+Proposals#
> > >> > > > >>>>
> > >> > > > >>>>> ?
> > >> > > > >>>>>>>
> > >> > > > >>>>>>>> I tried, but the Wiki won't let me.
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>> -Ray
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
> > >> > > > >>>>>>>>>> Hey guys,
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>>> @Colin - good point. I added some sentences
> mentioning
> > >> > recent
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>> improvements
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> in the introductory section.
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>>> *Disk Failure* - I tend to agree with what Colin
> said -
> > >> > once a
> > >> > > > >>>>>>>>>> disk
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>> fails,
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> you don't want to work with it again. As such, I've
> > >> changed
> > >> > my
> > >> > > > >>>>>>>>>> mind
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>> and
> > >> > > > >>>>>>
> > >> > > > >>>>>>> believe that we should mark the LogDir (assume its a
> disk)
> > >> as
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>> offline
> > >> > > > >>>>
> > >> > > > >>>>> on
> > >> > > > >>>>>>
> > >> > > > >>>>>>> the first `IOException` encountered. This is the
> > >> LogCleaner's
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>> current
> > >> > > > >>>>
> > >> > > > >>>>> behavior. We shouldn't change that.
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>>> *Respawning Threads* - I believe we should never
> > >> re-spawn a
> > >> > > > >>>>>>>>>> thread.
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>> The
> > >> > > > >>>>>>
> > >> > > > >>>>>>> correct approach in my mind is to either have it stay
> dead
> > >> or
> > >> > > never
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>> let
> > >> > > > >>>>>>
> > >> > > > >>>>>>> it
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> die in the first place.
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>>> *Uncleanable-partition-names metric* - Colin is
> right,
> > >> this
> > >> > > > metric
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>> is
> > >> > > > >>>>
> > >> > > > >>>>> unneeded. Users can monitor the
> > `uncleanable-partitions-count`
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>> metric
> > >> > > > >>>>
> > >> > > > >>>>> and
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> inspect logs.
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>>> Hey Ray,
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>>> 2) I'm 100% with James in agreement with setting up
> the
> > >> > > > LogCleaner
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>> to
> > >> > > > >>>>
> > >> > > > >>>>> skip over problematic partitions instead of dying.
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>> I think we can do this for every exception that isn't
> > >> > > > >>>>>>>>>> `IOException`.
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>> This
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> will future-proof us against bugs in the system and
> > >> > potential
> > >> > > > >>>>>>>>>> other
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>> errors.
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> Protecting yourself against unexpected failures is
> > >> always a
> > >> > > good
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>> thing
> > >> > > > >>>>
> > >> > > > >>>>> in
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> my mind, but I also think that protecting yourself
> > >> against
> > >> > > bugs
> > >> > > > in
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>> the
> > >> > > > >>>>
> > >> > > > >>>>> software is sort of clunky. What does everybody think
> about
> > >> this?
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>>> 4) The only improvement I can think of is that if
> such
> > an
> > >> > > > >>>>>>>>>>> error occurs, then have the option (configuration
> > >> setting?)
> > >> > > to
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>> create a
> > >> > > > >>>>>>
> > >> > > > >>>>>>> <log_segment>.skip file (or something similar).
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>> This is a good suggestion. Have others also seen
> > >> corruption
> > >> > be
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>> generally
> > >> > > > >>>>>>
> > >> > > > >>>>>>> tied to the same segment?
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <
> > >> > > > >>>>>>>>>> dhruvil@confluent.io
> > >> > > > >>>>>>>>>>
> > >> > > > >>>>>>>>> wrote:
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> For the cleaner thread specifically, I do not think
> > >> > respawning
> > >> > > > >>>>>>>>>>> will
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>> help at
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> all because we are more than likely to run into the
> > same
> > >> > issue
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>> again
> > >> > > > >>>>
> > >> > > > >>>>> which
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> would end up crashing the cleaner. Retrying makes
> sense
> > >> for
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>> transient
> > >> > > > >>>>
> > >> > > > >>>>> errors or when you believe some part of the system could
> > have
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>> healed
> > >> > > > >>>>
> > >> > > > >>>>> itself, both of which I think are not true for the log
> > >> cleaner.
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <
> > >> > > > >>>>>>>>>>> rndgstn@gmail.com>
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>> wrote:
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> <<<respawning threads is likely to make things worse,
> > by
> > >> > > putting
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>> you
> > >> > > > >>>>
> > >> > > > >>>>> in
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> an
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> infinite loop which consumes resources and fires
> off
> > >> > > > continuous
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>> log
> > >> > > > >>>>
> > >> > > > >>>>> messages.
> > >> > > > >>>>>>>>>>>> Hi Colin.  In case it could be relevant, one way to
> > >> > mitigate
> > >> > > > >>>>>>>>>>>> this
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>> effect
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> is
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> to implement a backoff mechanism (if a second
> respawn
> > >> is
> > >> > to
> > >> > > > >>>>>>>>>>>> occur
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>> then
> > >> > > > >>>>>>
> > >> > > > >>>>>>> wait
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> for 1 minute before doing it; then if a third
> respawn
> > >> is
> > >> > to
> > >> > > > >>>>>>>>>>>> occur
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>> wait
> > >> > > > >>>>>>
> > >> > > > >>>>>>> for
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> 2 minutes before doing it; then 4 minutes, 8
> minutes,
> > >> etc.
> > >> > > up
> > >> > > > to
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>> some
> > >> > > > >>>>>>
> > >> > > > >>>>>>> max
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> wait time).
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> I have no opinion on whether respawn is appropriate
> > or
> > >> not
> > >> > > in
> > >> > > > >>>>>>>>>>>> this
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>> context,
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> but a mitigation like the increasing backoff
> > described
> > >> > above
> > >> > > > may
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>> be
> > >> > > > >>>>
> > >> > > > >>>>> relevant in weighing the pros and cons.
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> Ron
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <
> > >> > > > >>>>>>>>>>>> cmccabe@apache.org>
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>> wrote:
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> > >> > > > >>>>>>>>>>>>>> Hi Stanislav! Thanks for this KIP!
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>>> I agree that it would be good if the LogCleaner
> > were
> > >> > more
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> tolerant
> > >> > > > >>>>
> > >> > > > >>>>> of
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> errors. Currently, as you said, once it dies, it
> stays
> > >> dead.
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>>> Things are better now than they used to be. We
> have
> > >> the
> > >> > > > metric
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>>> kafka.log:type=LogCleanerManager,name=time-
> > >> > > since-last-run-ms
> > >> > > > >>>>
> > >> > > > >>>>> which we can use to tell us if the threads are dead. And
> as
> > of
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> 1.1.0,
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> we
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> have KIP-226, which allows you to restart the log
> > >> cleaner
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> thread,
> > >> > > > >>>>
> > >> > > > >>>>> without requiring a broker restart.
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+
> > >> > > > >>>> Dynamic+Broker+Configuration
> > >> > > > >>>>
> > >> > > > >>>>> <
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> https://cwiki.apache.org/confl
> > >> uence/display/KAFKA/KIP-
> > >> > > 226+-+
> > >> > > > >>>> Dynamic+Broker+Configuration
> > >> > > > >>>>
> > >> > > > >>>>> I've only read about this, I haven't personally tried it.
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> Thanks for pointing this out, James!  Stanislav,
> we
> > >> > should
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> probably
> > >> > > > >>>>
> > >> > > > >>>>> add a
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> sentence or two mentioning the KIP-226 changes
> > >> somewhere
> > >> > in
> > >> > > > the
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> KIP.
> > >> > > > >>>>>>
> > >> > > > >>>>>>> Maybe
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> in the intro section?
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> I think it's clear that requiring the users to
> > >> manually
> > >> > > > restart
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> the
> > >> > > > >>>>
> > >> > > > >>>>> log
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> cleaner is not a very good solution.  But it's good
> to
> > >> know
> > >> > > that
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> it's a
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> possibility on some older releases.
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> Some comments:
> > >> > > > >>>>>>>>>>>>>> * I like the idea of having the log cleaner
> > continue
> > >> to
> > >> > > > clean
> > >> > > > >>>>>>>>>>>>>> as
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> many
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> partitions as it can, skipping over the problematic
> > ones
> > >> if
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> possible.
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> * If the log cleaner thread dies, I think it should
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> automatically
> > >> > > > >>>>
> > >> > > > >>>>> be
> > >> > > > >>>>>>
> > >> > > > >>>>>>> revived. Your KIP attempts to do that by catching
> > exceptions
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> during
> > >> > > > >>>>>>
> > >> > > > >>>>>>> execution, but I think we should go all the way and make
> > >> sure
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> that a
> > >> > > > >>>>>>
> > >> > > > >>>>>>> new
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> one gets created, if the thread ever dies.
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> This is inconsistent with the way the rest of
> Kafka
> > >> > works.
> > >> > > > We
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> don't
> > >> > > > >>>>>>
> > >> > > > >>>>>>> automatically re-create other threads in the broker if
> > they
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> terminate.
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> In
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> general, if there is a serious bug in the code,
> > >> > respawning
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> threads
> > >> > > > >>>>
> > >> > > > >>>>> is
> > >> > > > >>>>>>
> > >> > > > >>>>>>> likely to make things worse, by putting you in an
> infinite
> > >> loop
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> which
> > >> > > > >>>>>>
> > >> > > > >>>>>>> consumes resources and fires off continuous log
> messages.
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> * It might be worth trying to re-clean the
> > uncleanable
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> partitions.
> > >> > > > >>>>
> > >> > > > >>>>> I've
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> seen cases where an uncleanable partition later
> > became
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> cleanable.
> > >> > > > >>>>
> > >> > > > >>>>> I
> > >> > > > >>>>>>
> > >> > > > >>>>>>> unfortunately don't remember how that happened, but I
> > >> remember
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> being
> > >> > > > >>>>>>
> > >> > > > >>>>>>> surprised when I discovered it. It might have been
> > something
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> like
> > >> > > > >>>>
> > >> > > > >>>>> a
> > >> > > > >>>>>>
> > >> > > > >>>>>>> follower was uncleanable but after a leader election
> > >> happened,
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> the
> > >> > > > >>>>
> > >> > > > >>>>> log
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> truncated and it was then cleanable again. I'm not
> > >> sure.
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> James, I disagree.  We had this behavior in the
> > Hadoop
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> Distributed
> > >> > > > >>>>
> > >> > > > >>>>> File
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> System (HDFS) and it was a constant source of user
> > >> problems.
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> What would happen is disks would just go bad over
> > >> time.
> > >> > > The
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> DataNode
> > >> > > > >>>>>>
> > >> > > > >>>>>>> would notice this and take them offline.  But then, due
> to
> > >> some
> > >> > > > >>>>>>>>>>>>> "optimistic" code, the DataNode would periodically
> > >> try to
> > >> > > > >>>>>>>>>>>>> re-add
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> them
> > >> > > > >>>>>>
> > >> > > > >>>>>>> to
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> the system.  Then one of two things would happen:
> the
> > >> disk
> > >> > > > would
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> just
> > >> > > > >>>>>>
> > >> > > > >>>>>>> fail
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> immediately again, or it would appear to work and
> > then
> > >> > fail
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> after a
> > >> > > > >>>>
> > >> > > > >>>>> short
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> amount of time.
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> The way the disk failed was normally having an I/O
> > >> > request
> > >> > > > >>>>>>>>>>>>> take a
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> really
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> long time and time out.  So a bunch of request
> > handler
> > >> > > threads
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> would
> > >> > > > >>>>>>
> > >> > > > >>>>>>> basically slam into a brick wall when they tried to
> access
> > >> the
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> bad
> > >> > > > >>>>
> > >> > > > >>>>> disk,
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> slowing the DataNode to a crawl.  It was even worse
> > in
> > >> the
> > >> > > > >>>>>>>>>>>>> second
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> scenario,
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> if the disk appeared to work for a while, but then
> > >> > failed.
> > >> > > > Any
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> data
> > >> > > > >>>>>>
> > >> > > > >>>>>>> that
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> had been written on that DataNode to that disk
> would
> > be
> > >> > > lost,
> > >> > > > >>>>>>>>>>>>> and
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> we
> > >> > > > >>>>>>
> > >> > > > >>>>>>> would
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> need to re-replicate it.
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> Disks aren't biological systems-- they don't heal
> > over
> > >> > > time.
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> Once
> > >> > > > >>>>
> > >> > > > >>>>> they're
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> bad, they stay bad.  The log cleaner needs to be
> > >> robust
> > >> > > > against
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> cases
> > >> > > > >>>>>>
> > >> > > > >>>>>>> where
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> the disk really is failing, and really is
> returning
> > >> bad
> > >> > > data
> > >> > > > or
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> timing
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> out.
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> * For your metrics, can you spell out the full
> > metric
> > >> in
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> JMX-style
> > >> > > > >>>>
> > >> > > > >>>>> format, such as:
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>>>
> >  kafka.log:type=LogCleanerManager,name=uncleanable-
> > >> > > > >>>> partitions-count
> > >> > > > >>>>
> > >> > > > >>>>>                 value=4
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>>> * For "uncleanable-partitions": topic-partition
> > names
> > >> > can
> > >> > > be
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> very
> > >> > > > >>>>
> > >> > > > >>>>> long.
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> I think the current max size is 210 characters (or
> > >> maybe
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> 240-ish?).
> > >> > > > >>>>>>
> > >> > > > >>>>>>> Having the "uncleanable-partitions" being a list could
> be
> > >> very
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> large
> > >> > > > >>>>>>
> > >> > > > >>>>>>> metric. Also, having the metric come out as a csv might
> be
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> difficult
> > >> > > > >>>>>>
> > >> > > > >>>>>>> to
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> work with for monitoring systems. If we *did* want
> > the
> > >> > topic
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> names
> > >> > > > >>>>
> > >> > > > >>>>> to
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> be
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> accessible, what do you think of having the
> > >> > > > >>>>>>>>>>>>>>         kafka.log:type=LogCleanerManag
> > >> > > > >>>>>>>>>>>>>> er,topic=topic1,partition=2
> > >> > > > >>>>>>>>>>>>>> I'm not sure if LogCleanerManager is the right
> > type,
> > >> but
> > >> > > my
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> example
> > >> > > > >>>>>>
> > >> > > > >>>>>>> was
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> that the topic and partition can be tags in the
> > metric.
> > >> > That
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> will
> > >> > > > >>>>
> > >> > > > >>>>> allow
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> monitoring systems to more easily slice and dice
> the
> > >> > metric.
> > >> > > > I'm
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> not
> > >> > > > >>>>>>
> > >> > > > >>>>>>> sure what the attribute for that metric would be. Maybe
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> something
> > >> > > > >>>>
> > >> > > > >>>>> like
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> time-since-last-clean?
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> Or
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> maybe even just "Value=1".
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> I haven't though about this that hard, but do we
> > >> really
> > >> > > need
> > >> > > > >>>>>>>>>>>>> the
> > >> > > > >>>>>>>>>>>>> uncleanable topic names to be accessible through a
> > >> > metric?
> > >> > > > It
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> seems
> > >> > > > >>>>>>
> > >> > > > >>>>>>> like
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> the admin should notice that uncleanable partitions
> > are
> > >> > > > present,
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> and
> > >> > > > >>>>>>
> > >> > > > >>>>>>> then
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> check the logs?
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> * About `max.uncleanable.partitions`, you said
> that
> > >> this
> > >> > > > likely
> > >> > > > >>>>>>>>>>>>>> indicates that the disk is having problems. I'm
> not
> > >> sure
> > >> > > > that
> > >> > > > >>>>>>>>>>>>>> is
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> the
> > >> > > > >>>>>>
> > >> > > > >>>>>>> case. For the 4 JIRAs that you mentioned about log
> cleaner
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> problems,
> > >> > > > >>>>>>
> > >> > > > >>>>>>> all
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> of them are partition-level scenarios that
> happened
> > >> > during
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> normal
> > >> > > > >>>>
> > >> > > > >>>>> operation. None of them were indicative of disk problems.
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> I don't think this is a meaningful comparison.  In
> > >> > general,
> > >> > > > we
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> don't
> > >> > > > >>>>>>
> > >> > > > >>>>>>> accept JIRAs for hard disk problems that happen on a
> > >> particular
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> cluster.
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> If someone opened a JIRA that said "my hard disk is
> > >> having
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> problems"
> > >> > > > >>>>>>
> > >> > > > >>>>>>> we
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> could close that as "not a Kafka bug."  This doesn't
> > >> prove
> > >> > > that
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> disk
> > >> > > > >>>>>>
> > >> > > > >>>>>>> problems don't happen, but  just that JIRA isn't the
> right
> > >> > place
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> for
> > >> > > > >>>>>>
> > >> > > > >>>>>>> them.
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> I do agree that the log cleaner has had a
> > significant
> > >> > > number
> > >> > > > of
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> logic
> > >> > > > >>>>>>
> > >> > > > >>>>>>> bugs, and that we need to be careful to limit their
> > impact.
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> That's
> > >> > > > >>>>
> > >> > > > >>>>> one
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> reason why I think that a threshold of "number of
> > >> > uncleanable
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> logs"
> > >> > > > >>>>
> > >> > > > >>>>> is
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> a
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> good idea, rather than just failing after one
> > >> IOException.
> > >> > > In
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> all
> > >> > > > >>>>
> > >> > > > >>>>> the
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> cases I've seen where a user hit a logic bug in the
> log
> > >> > > cleaner,
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> it
> > >> > > > >>>>
> > >> > > > >>>>> was
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> just one partition that had the issue.  We also
> should
> > >> > > increase
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> test
> > >> > > > >>>>>>
> > >> > > > >>>>>>> coverage for the log cleaner.
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> * About marking disks as offline when exceeding a
> > >> certain
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> threshold,
> > >> > > > >>>>>>
> > >> > > > >>>>>>> that actually increases the blast radius of log
> compaction
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> failures.
> > >> > > > >>>>>>
> > >> > > > >>>>>>> Currently, the uncleaned partitions are still readable
> and
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> writable.
> > >> > > > >>>>>>
> > >> > > > >>>>>>> Taking the disks offline would impact availability of
> the
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> uncleanable
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> partitions, as well as impact all other partitions
> that
> > >> are
> > >> > on
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> the
> > >> > > > >>>>
> > >> > > > >>>>> disk.
> > >> > > > >>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> In general, when we encounter I/O errors, we take
> > the
> > >> > disk
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> partition
> > >> > > > >>>>>>
> > >> > > > >>>>>>> offline.  This is spelled out in KIP-112 (
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>>
> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%
> > >> > > > >>>> 3A+Handle+disk+failure+for+JBOD
> > >> > > > >>>>
> > >> > > > >>>>> ) :
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> - Broker assumes a log directory to be good after
> it
> > >> > > starts,
> > >> > > > >>>>>>>>>>>>>> and
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> mark
> > >> > > > >>>>>>>>>
> > >> > > > >>>>>>>>>> log directory as
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>>> bad once there is IOException when broker
> attempts
> > to
> > >> > > access
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> (i.e.
> > >> > > > >>>>
> > >> > > > >>>>> read
> > >> > > > >>>>>>>>>>>
> > >> > > > >>>>>>>>>>>> or write) the log directory.
> > >> > > > >>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>>> - Broker will be offline if all log directories
> are
> > >> bad.
> > >> > > > >>>>>>>>>>>>>> - Broker will stop serving replicas in any bad
> log
> > >> > > > directory.
> > >> > > > >>>>>>>>>>>>>>
> > >> > > > >>>>>>>>>>>>> New
> > >> > > > >>>
> > >> > > > >>>
> > >> > > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Best,
> > >> > > Stanislav
> > >> > >
> > >> >
> > >>
> > >>
> > >> --
> > >> Best,
> > >> Stanislav
> > >>
> > >
> > >
> >
>
>
> --
> Best,
> Stanislav
>

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Stanislav Kozlovski <st...@confluent.io>.

Hi Jason,

I was thinking about your suggestion. I agree that it makes sense to cap it
at a certain threshold and it doesn't sound *too* restrictive to me either,
considering the common case.

The issue with the __consumer_offsets topic is problematic, that is true.
Nevertheless, I have some concerns with having a certain threshold of
`uncleanable.bytes`. There is now a chance that a single error in a big
partition (other than __consumer_offsets) marks the directory as offline
outright. To avoid this, we would need to have it be set to *at least* half
of the biggest compacted partition's size - this is since the default of
`log.cleaner.min.cleanable.ratio` is 0.5. Even then, that single partition
will quickly go over the threshold since it is not cleaned at all.

Ideally, we'd want to validate that more partitions are failing before
marking the disk as offline to best ensure it is an actual disk problem.
Having a threshold makes this tricky. Placing a reasonable default value
seems very hard too, as it would either be too small (mark too fast) or too
big (never mark offline) for some users, which would cause issues in the
former case. Perhaps the best approach would be to have the functionality
be disabled by default.

I am now left with the conclusion that it's best to have that functionality
be disabled by default. Since configs are relatively easy to add but hard
to take away, I believe it might be best to drop off that functionality in
this KIP. We could consider adding it later if the community believes it is
needed.
I consider that a reasonable approach, since the main perceived benefit of
this KIP is the isolation of partition failures and to some extent the new
metrics.

What are other people's thoughts on this? I have updated the KIP
accordingly.

Best,
Stanislav

On Wed, Aug 15, 2018 at 12:27 AM Jason Gustafson <ja...@confluent.io> wrote:

> Sorry for the noise. Let me try again:
>
> My initial suggestion was to *track *the uncleanable disk space.
> > I can see why marking a log directory as offline after a certain
> threshold
> > of uncleanable disk space is more useful.
> > I'm not sure if we can set that threshold to be of certain size (e.g
> 100GB)
> > as log directories might have different sizes.  Maybe a percentage would
> be
> > better then (e.g 30% of whole log dir size), WDYT?
>
>
> The two most common problems I am aware of when the log cleaner crashes are
> 1) running out of disk space and 2) excessive coordinator loading time. The
> problem in the latter case is that when the log cleaner is not running, the
> __consumer_offsets topics can become huge. If there is a failure which
> causes a coordinator change, then it can take a long time for the new
> coordinator to load the offset cache since it reads from the beginning.
> Consumers are effectively dead in the water when this happens since they
> cannot commit offsets. We've seen coordinator loading times in the hours
> for some users. If we could set a total cap on the uncleanable size, then
> we can reduce the impact from unbounded __consumer_offsets growth.
>
> Also it's true that log directories may have different sizes, but I'm not
> sure that is a common case. I don't think it would be too restrictive to
> use a single max size for all directories. I think the key is just having
> some way to cap the size of the uncleaned data.
>
> I feel it still makes sense to have a metric tracking how many uncleanable
> > partitions there are and the total amount of uncleanable disk space (per
> > log dir, via a JMX tag).
> > But now, rather than fail the log directory after a certain count of
> > uncleanable partitions, we could fail it after a certain percentage (or
> > size) of its storage is uncleanable.
>
>
> Yes, having the metric for uncleanable partitions could be useful. I was
> mostly concerned about the corresponding config since it didn't seem to
> address the main problems with the cleaner dying.
>
> Thanks,
> Jason
>
> On Tue, Aug 14, 2018 at 4:11 PM, Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hey Stanislav, responses below:
> >
> > My initial suggestion was to *track *the uncleanable disk space.
> >> I can see why marking a log directory as offline after a certain
> threshold
> >> of uncleanable disk space is more useful.
> >> I'm not sure if we can set that threshold to be of certain size (e.g
> >> 100GB)
> >> as log directories might have different sizes.  Maybe a percentage would
> >> be
> >> better then (e.g 30% of whole log dir size), WDYT?
> >
> >
> >
> >
> >
> > On Fri, Aug 10, 2018 at 2:05 AM, Stanislav Kozlovski <
> > stanislav@confluent.io> wrote:
> >
> >> Hey Jason,
> >>
> >> My initial suggestion was to *track *the uncleanable disk space.
> >> I can see why marking a log directory as offline after a certain
> threshold
> >> of uncleanable disk space is more useful.
> >> I'm not sure if we can set that threshold to be of certain size (e.g
> >> 100GB)
> >> as log directories might have different sizes.  Maybe a percentage would
> >> be
> >> better then (e.g 30% of whole log dir size), WDYT?
> >>
> >> I feel it still makes sense to have a metric tracking how many
> uncleanable
> >> partitions there are and the total amount of uncleanable disk space (per
> >> log dir, via a JMX tag).
> >> But now, rather than fail the log directory after a certain count of
> >> uncleanable partitions, we could fail it after a certain percentage (or
> >> size) of its storage is uncleanable.
> >>
> >> I'd like to hear other people's thoughts on this. Sound good?
> >>
> >> Best,
> >> Stanislav
> >>
> >>
> >>
> >>
> >> On Fri, Aug 10, 2018 at 12:40 AM Jason Gustafson <ja...@confluent.io>
> >> wrote:
> >>
> >> > Hey Stanislav,
> >> >
> >> > Sorry, I was probably looking at an older version (I had the tab open
> >> for
> >> > so long!).
> >> >
> >> > I have been thinking about `max.uncleanable.partitions` and wondering
> if
> >> > it's what we really want. The main risk if the cleaner cannot clean a
> >> > partition is eventually running out of disk space. This is the most
> >> common
> >> > problem we have seen with cleaner failures and it can happen even if
> >> there
> >> > is just one uncleanable partition. We've actually seen cases in which
> a
> >> > single __consumer_offsets grew large enough to fill a significant
> >> portion
> >> > of the disk. The difficulty with allowing a system to run out of disk
> >> space
> >> > before failing is that it makes recovery difficult and time consuming.
> >> > Clean shutdown, for example, requires writing some state to disk.
> >> Without
> >> > clean shutdown, it can take the broker significantly longer to startup
> >> > because it has do more segment recovery.
> >> >
> >> > For this problem, `max.uncleanable.partitions` does not really help.
> You
> >> > can set it to 1 and fail fast, but that is not much better than the
> >> > existing state. You had a suggestion previously in the KIP to use the
> >> size
> >> > of uncleanable disk space instead. What was the reason for rejecting
> >> that?
> >> > Intuitively, it seems like a better fit for a cleaner failure. It
> would
> >> > provide users some time to react to failures while still protecting
> them
> >> > from exhausting the disk.
> >> >
> >> > Thanks,
> >> > Jason
> >> >
> >> >
> >> >
> >> >
> >> > On Thu, Aug 9, 2018 at 9:46 AM, Stanislav Kozlovski <
> >> > stanislav@confluent.io>
> >> > wrote:
> >> >
> >> > > Hey Jason,
> >> > >
> >> > > 1. *10* is the default value, it says so in the KIP
> >> > > 2. This is a good catch. As the current implementation stands, it's
> >> not a
> >> > > useful metric since the thread continues to run even if all log
> >> > directories
> >> > > are offline (although I'm not sure what the broker's behavior is in
> >> that
> >> > > scenario). I'll make sure the thread stops if all log directories
> are
> >> > > online.
> >> > >
> >> > > I don't know which "Needs Discussion" item you're referencing, there
> >> > hasn't
> >> > > been any in the KIP since August 1 and that was for the metric only.
> >> KIP
> >> > > History
> >> > > <https://cwiki.apache.org/confluence/pages/viewpreviousversi
> >> ons.action?
> >> > > pageId=89064875>
> >> > >
> >> > > I've updated the KIP to mention the "time-since-last-run" metric.
> >> > >
> >> > > Thanks,
> >> > > Stanislav
> >> > >
> >> > > On Wed, Aug 8, 2018 at 12:12 AM Jason Gustafson <jason@confluent.io
> >
> >> > > wrote:
> >> > >
> >> > > > Hi Stanislav,
> >> > > >
> >> > > > Just a couple quick questions:
> >> > > >
> >> > > > 1. I may have missed it, but what will be the default value for
> >> > > > `max.uncleanable.partitions`?
> >> > > > 2. It seems there will be some impact for users that monitoring
> >> > > > "time-since-last-run-ms" in order to detect cleaner failures. Not
> >> sure
> >> > > it's
> >> > > > a major concern, but probably worth mentioning in the
> compatibility
> >> > > > section. Also, is this still a useful metric after this KIP?
> >> > > >
> >> > > > Also, maybe the "Needs Discussion" item can be moved to rejected
> >> > > > alternatives since you've moved to a vote? I think leaving this
> for
> >> > > > potential future work is reasonable.
> >> > > >
> >> > > > Thanks,
> >> > > > Jason
> >> > > >
> >> > > >
> >> > > > On Mon, Aug 6, 2018 at 12:29 PM, Ray Chiang <rc...@apache.org>
> >> > wrote:
> >> > > >
> >> > > > > I'm okay with that.
> >> > > > >
> >> > > > > -Ray
> >> > > > >
> >> > > > > On 8/6/18 10:59 AM, Colin McCabe wrote:
> >> > > > >
> >> > > > >> Perhaps we could start with max.uncleanable.partitions and then
> >> > > > implement
> >> > > > >> max.uncleanable.partitions.per.logdir in a follow-up change if
> >> it
> >> > > seemed
> >> > > > >> to be necessary?  What do you think?
> >> > > > >>
> >> > > > >> regards,
> >> > > > >> Colin
> >> > > > >>
> >> > > > >>
> >> > > > >> On Sat, Aug 4, 2018, at 10:53, Stanislav Kozlovski wrote:
> >> > > > >>
> >> > > > >>> Hey Ray,
> >> > > > >>>
> >> > > > >>> Thanks for the explanation. In regards to the configuration
> >> > property
> >> > > -
> >> > > > >>> I'm
> >> > > > >>> not sure. As long as it has sufficient documentation, I find
> >> > > > >>> "max.uncleanable.partitions" to be okay. If we were to add the
> >> > > > >>> distinction
> >> > > > >>> explicitly, maybe it should be `max.uncleanable.partitions.
> >> > > per.logdir`
> >> > > > ?
> >> > > > >>>
> >> > > > >>> On Thu, Aug 2, 2018 at 7:32 PM Ray Chiang <rchiang@apache.org
> >
> >> > > wrote:
> >> > > > >>>
> >> > > > >>> One more thing occurred to me.  Should the configuration
> >> property
> >> > be
> >> > > > >>>> named "max.uncleanable.partitions.per.disk" instead?
> >> > > > >>>>
> >> > > > >>>> -Ray
> >> > > > >>>>
> >> > > > >>>>
> >> > > > >>>> On 8/1/18 9:11 AM, Stanislav Kozlovski wrote:
> >> > > > >>>>
> >> > > > >>>>> Yes, good catch. Thank you, James!
> >> > > > >>>>>
> >> > > > >>>>> Best,
> >> > > > >>>>> Stanislav
> >> > > > >>>>>
> >> > > > >>>>> On Wed, Aug 1, 2018 at 5:05 PM James Cheng <
> >> wushujames@gmail.com
> >> > >
> >> > > > >>>>> wrote:
> >> > > > >>>>>
> >> > > > >>>>> Can you update the KIP to say what the default is for
> >> > > > >>>>>> max.uncleanable.partitions?
> >> > > > >>>>>>
> >> > > > >>>>>> -James
> >> > > > >>>>>>
> >> > > > >>>>>> Sent from my iPhone
> >> > > > >>>>>>
> >> > > > >>>>>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <
> >> > > > >>>>>>>
> >> > > > >>>>>> stanislav@confluent.io>
> >> > > > >>>>
> >> > > > >>>>> wrote:
> >> > > > >>>>>>
> >> > > > >>>>>>> Hey group,
> >> > > > >>>>>>>
> >> > > > >>>>>>> I am planning on starting a voting thread tomorrow. Please
> >> do
> >> > > reply
> >> > > > >>>>>>> if
> >> > > > >>>>>>>
> >> > > > >>>>>> you
> >> > > > >>>>>>
> >> > > > >>>>>>> feel there is anything left to discuss.
> >> > > > >>>>>>>
> >> > > > >>>>>>> Best,
> >> > > > >>>>>>> Stanislav
> >> > > > >>>>>>>
> >> > > > >>>>>>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
> >> > > > >>>>>>>
> >> > > > >>>>>> stanislav@confluent.io>
> >> > > > >>>>>>
> >> > > > >>>>>>> wrote:
> >> > > > >>>>>>>
> >> > > > >>>>>>> Hey, Ray
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> Thanks for pointing that out, it's fixed now
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> Best,
> >> > > > >>>>>>>> Stanislav
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <
> >> > rchiang@apache.org>
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>> wrote:
> >> > > > >>>>
> >> > > > >>>>> Thanks.  Can you fix the link in the "KIPs under discussion"
> >> > table
> >> > > on
> >> > > > >>>>>>>>> the main KIP landing page
> >> > > > >>>>>>>>> <
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+
> >> > > > >>>> Improvement+Proposals#
> >> > > > >>>>
> >> > > > >>>>> ?
> >> > > > >>>>>>>
> >> > > > >>>>>>>> I tried, but the Wiki won't let me.
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>> -Ray
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
> >> > > > >>>>>>>>>> Hey guys,
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> @Colin - good point. I added some sentences mentioning
> >> > recent
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>> improvements
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> in the introductory section.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> *Disk Failure* - I tend to agree with what Colin said -
> >> > once a
> >> > > > >>>>>>>>>> disk
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>> fails,
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> you don't want to work with it again. As such, I've
> >> changed
> >> > my
> >> > > > >>>>>>>>>> mind
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>> and
> >> > > > >>>>>>
> >> > > > >>>>>>> believe that we should mark the LogDir (assume its a disk)
> >> as
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>> offline
> >> > > > >>>>
> >> > > > >>>>> on
> >> > > > >>>>>>
> >> > > > >>>>>>> the first `IOException` encountered. This is the
> >> LogCleaner's
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>> current
> >> > > > >>>>
> >> > > > >>>>> behavior. We shouldn't change that.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> *Respawning Threads* - I believe we should never
> >> re-spawn a
> >> > > > >>>>>>>>>> thread.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>> The
> >> > > > >>>>>>
> >> > > > >>>>>>> correct approach in my mind is to either have it stay dead
> >> or
> >> > > never
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>> let
> >> > > > >>>>>>
> >> > > > >>>>>>> it
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> die in the first place.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> *Uncleanable-partition-names metric* - Colin is right,
> >> this
> >> > > > metric
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>> is
> >> > > > >>>>
> >> > > > >>>>> unneeded. Users can monitor the
> `uncleanable-partitions-count`
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>> metric
> >> > > > >>>>
> >> > > > >>>>> and
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> inspect logs.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> Hey Ray,
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> 2) I'm 100% with James in agreement with setting up the
> >> > > > LogCleaner
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>> to
> >> > > > >>>>
> >> > > > >>>>> skip over problematic partitions instead of dying.
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>> I think we can do this for every exception that isn't
> >> > > > >>>>>>>>>> `IOException`.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>> This
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> will future-proof us against bugs in the system and
> >> > potential
> >> > > > >>>>>>>>>> other
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>> errors.
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> Protecting yourself against unexpected failures is
> >> always a
> >> > > good
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>> thing
> >> > > > >>>>
> >> > > > >>>>> in
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> my mind, but I also think that protecting yourself
> >> against
> >> > > bugs
> >> > > > in
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>> the
> >> > > > >>>>
> >> > > > >>>>> software is sort of clunky. What does everybody think about
> >> this?
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> 4) The only improvement I can think of is that if such
> an
> >> > > > >>>>>>>>>>> error occurs, then have the option (configuration
> >> setting?)
> >> > > to
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>> create a
> >> > > > >>>>>>
> >> > > > >>>>>>> <log_segment>.skip file (or something similar).
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>> This is a good suggestion. Have others also seen
> >> corruption
> >> > be
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>> generally
> >> > > > >>>>>>
> >> > > > >>>>>>> tied to the same segment?
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <
> >> > > > >>>>>>>>>> dhruvil@confluent.io
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>> wrote:
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> For the cleaner thread specifically, I do not think
> >> > respawning
> >> > > > >>>>>>>>>>> will
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>> help at
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> all because we are more than likely to run into the
> same
> >> > issue
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>> again
> >> > > > >>>>
> >> > > > >>>>> which
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> would end up crashing the cleaner. Retrying makes sense
> >> for
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>> transient
> >> > > > >>>>
> >> > > > >>>>> errors or when you believe some part of the system could
> have
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>> healed
> >> > > > >>>>
> >> > > > >>>>> itself, both of which I think are not true for the log
> >> cleaner.
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <
> >> > > > >>>>>>>>>>> rndgstn@gmail.com>
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>> wrote:
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> <<<respawning threads is likely to make things worse,
> by
> >> > > putting
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>> you
> >> > > > >>>>
> >> > > > >>>>> in
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> an
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> infinite loop which consumes resources and fires off
> >> > > > continuous
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>> log
> >> > > > >>>>
> >> > > > >>>>> messages.
> >> > > > >>>>>>>>>>>> Hi Colin.  In case it could be relevant, one way to
> >> > mitigate
> >> > > > >>>>>>>>>>>> this
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>> effect
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> is
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> to implement a backoff mechanism (if a second respawn
> >> is
> >> > to
> >> > > > >>>>>>>>>>>> occur
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>> then
> >> > > > >>>>>>
> >> > > > >>>>>>> wait
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> for 1 minute before doing it; then if a third respawn
> >> is
> >> > to
> >> > > > >>>>>>>>>>>> occur
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>> wait
> >> > > > >>>>>>
> >> > > > >>>>>>> for
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes,
> >> etc.
> >> > > up
> >> > > > to
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>> some
> >> > > > >>>>>>
> >> > > > >>>>>>> max
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> wait time).
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> I have no opinion on whether respawn is appropriate
> or
> >> not
> >> > > in
> >> > > > >>>>>>>>>>>> this
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>> context,
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> but a mitigation like the increasing backoff
> described
> >> > above
> >> > > > may
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>> be
> >> > > > >>>>
> >> > > > >>>>> relevant in weighing the pros and cons.
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> Ron
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <
> >> > > > >>>>>>>>>>>> cmccabe@apache.org>
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>> wrote:
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> >> > > > >>>>>>>>>>>>>> Hi Stanislav! Thanks for this KIP!
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>> I agree that it would be good if the LogCleaner
> were
> >> > more
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> tolerant
> >> > > > >>>>
> >> > > > >>>>> of
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> errors. Currently, as you said, once it dies, it stays
> >> dead.
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>> Things are better now than they used to be. We have
> >> the
> >> > > > metric
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>> kafka.log:type=LogCleanerManager,name=time-
> >> > > since-last-run-ms
> >> > > > >>>>
> >> > > > >>>>> which we can use to tell us if the threads are dead. And as
> of
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> 1.1.0,
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> we
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> have KIP-226, which allows you to restart the log
> >> cleaner
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> thread,
> >> > > > >>>>
> >> > > > >>>>> without requiring a broker restart.
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>
> >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+
> >> > > > >>>> Dynamic+Broker+Configuration
> >> > > > >>>>
> >> > > > >>>>> <
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> https://cwiki.apache.org/confl
> >> uence/display/KAFKA/KIP-
> >> > > 226+-+
> >> > > > >>>> Dynamic+Broker+Configuration
> >> > > > >>>>
> >> > > > >>>>> I've only read about this, I haven't personally tried it.
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> Thanks for pointing this out, James!  Stanislav, we
> >> > should
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> probably
> >> > > > >>>>
> >> > > > >>>>> add a
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> sentence or two mentioning the KIP-226 changes
> >> somewhere
> >> > in
> >> > > > the
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> KIP.
> >> > > > >>>>>>
> >> > > > >>>>>>> Maybe
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> in the intro section?
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> I think it's clear that requiring the users to
> >> manually
> >> > > > restart
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> the
> >> > > > >>>>
> >> > > > >>>>> log
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> cleaner is not a very good solution.  But it's good to
> >> know
> >> > > that
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> it's a
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> possibility on some older releases.
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> Some comments:
> >> > > > >>>>>>>>>>>>>> * I like the idea of having the log cleaner
> continue
> >> to
> >> > > > clean
> >> > > > >>>>>>>>>>>>>> as
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> many
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> partitions as it can, skipping over the problematic
> ones
> >> if
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> possible.
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> * If the log cleaner thread dies, I think it should
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> automatically
> >> > > > >>>>
> >> > > > >>>>> be
> >> > > > >>>>>>
> >> > > > >>>>>>> revived. Your KIP attempts to do that by catching
> exceptions
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> during
> >> > > > >>>>>>
> >> > > > >>>>>>> execution, but I think we should go all the way and make
> >> sure
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> that a
> >> > > > >>>>>>
> >> > > > >>>>>>> new
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> one gets created, if the thread ever dies.
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> This is inconsistent with the way the rest of Kafka
> >> > works.
> >> > > > We
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> don't
> >> > > > >>>>>>
> >> > > > >>>>>>> automatically re-create other threads in the broker if
> they
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> terminate.
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> In
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> general, if there is a serious bug in the code,
> >> > respawning
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> threads
> >> > > > >>>>
> >> > > > >>>>> is
> >> > > > >>>>>>
> >> > > > >>>>>>> likely to make things worse, by putting you in an infinite
> >> loop
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> which
> >> > > > >>>>>>
> >> > > > >>>>>>> consumes resources and fires off continuous log messages.
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> * It might be worth trying to re-clean the
> uncleanable
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> partitions.
> >> > > > >>>>
> >> > > > >>>>> I've
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> seen cases where an uncleanable partition later
> became
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> cleanable.
> >> > > > >>>>
> >> > > > >>>>> I
> >> > > > >>>>>>
> >> > > > >>>>>>> unfortunately don't remember how that happened, but I
> >> remember
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> being
> >> > > > >>>>>>
> >> > > > >>>>>>> surprised when I discovered it. It might have been
> something
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> like
> >> > > > >>>>
> >> > > > >>>>> a
> >> > > > >>>>>>
> >> > > > >>>>>>> follower was uncleanable but after a leader election
> >> happened,
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> the
> >> > > > >>>>
> >> > > > >>>>> log
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> truncated and it was then cleanable again. I'm not
> >> sure.
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> James, I disagree.  We had this behavior in the
> Hadoop
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> Distributed
> >> > > > >>>>
> >> > > > >>>>> File
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> System (HDFS) and it was a constant source of user
> >> problems.
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> What would happen is disks would just go bad over
> >> time.
> >> > > The
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> DataNode
> >> > > > >>>>>>
> >> > > > >>>>>>> would notice this and take them offline.  But then, due to
> >> some
> >> > > > >>>>>>>>>>>>> "optimistic" code, the DataNode would periodically
> >> try to
> >> > > > >>>>>>>>>>>>> re-add
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> them
> >> > > > >>>>>>
> >> > > > >>>>>>> to
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> the system.  Then one of two things would happen: the
> >> disk
> >> > > > would
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> just
> >> > > > >>>>>>
> >> > > > >>>>>>> fail
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> immediately again, or it would appear to work and
> then
> >> > fail
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> after a
> >> > > > >>>>
> >> > > > >>>>> short
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> amount of time.
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> The way the disk failed was normally having an I/O
> >> > request
> >> > > > >>>>>>>>>>>>> take a
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> really
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> long time and time out.  So a bunch of request
> handler
> >> > > threads
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> would
> >> > > > >>>>>>
> >> > > > >>>>>>> basically slam into a brick wall when they tried to access
> >> the
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> bad
> >> > > > >>>>
> >> > > > >>>>> disk,
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> slowing the DataNode to a crawl.  It was even worse
> in
> >> the
> >> > > > >>>>>>>>>>>>> second
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> scenario,
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> if the disk appeared to work for a while, but then
> >> > failed.
> >> > > > Any
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> data
> >> > > > >>>>>>
> >> > > > >>>>>>> that
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> had been written on that DataNode to that disk would
> be
> >> > > lost,
> >> > > > >>>>>>>>>>>>> and
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> we
> >> > > > >>>>>>
> >> > > > >>>>>>> would
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> need to re-replicate it.
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> Disks aren't biological systems-- they don't heal
> over
> >> > > time.
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> Once
> >> > > > >>>>
> >> > > > >>>>> they're
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> bad, they stay bad.  The log cleaner needs to be
> >> robust
> >> > > > against
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> cases
> >> > > > >>>>>>
> >> > > > >>>>>>> where
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> the disk really is failing, and really is returning
> >> bad
> >> > > data
> >> > > > or
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> timing
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> out.
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> * For your metrics, can you spell out the full
> metric
> >> in
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> JMX-style
> >> > > > >>>>
> >> > > > >>>>> format, such as:
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>>
>  kafka.log:type=LogCleanerManager,name=uncleanable-
> >> > > > >>>> partitions-count
> >> > > > >>>>
> >> > > > >>>>>                 value=4
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>> * For "uncleanable-partitions": topic-partition
> names
> >> > can
> >> > > be
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> very
> >> > > > >>>>
> >> > > > >>>>> long.
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> I think the current max size is 210 characters (or
> >> maybe
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> 240-ish?).
> >> > > > >>>>>>
> >> > > > >>>>>>> Having the "uncleanable-partitions" being a list could be
> >> very
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> large
> >> > > > >>>>>>
> >> > > > >>>>>>> metric. Also, having the metric come out as a csv might be
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> difficult
> >> > > > >>>>>>
> >> > > > >>>>>>> to
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> work with for monitoring systems. If we *did* want
> the
> >> > topic
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> names
> >> > > > >>>>
> >> > > > >>>>> to
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> be
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> accessible, what do you think of having the
> >> > > > >>>>>>>>>>>>>>         kafka.log:type=LogCleanerManag
> >> > > > >>>>>>>>>>>>>> er,topic=topic1,partition=2
> >> > > > >>>>>>>>>>>>>> I'm not sure if LogCleanerManager is the right
> type,
> >> but
> >> > > my
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> example
> >> > > > >>>>>>
> >> > > > >>>>>>> was
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> that the topic and partition can be tags in the
> metric.
> >> > That
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> will
> >> > > > >>>>
> >> > > > >>>>> allow
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> monitoring systems to more easily slice and dice the
> >> > metric.
> >> > > > I'm
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> not
> >> > > > >>>>>>
> >> > > > >>>>>>> sure what the attribute for that metric would be. Maybe
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> something
> >> > > > >>>>
> >> > > > >>>>> like
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> time-since-last-clean?
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> Or
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> maybe even just "Value=1".
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> I haven't though about this that hard, but do we
> >> really
> >> > > need
> >> > > > >>>>>>>>>>>>> the
> >> > > > >>>>>>>>>>>>> uncleanable topic names to be accessible through a
> >> > metric?
> >> > > > It
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> seems
> >> > > > >>>>>>
> >> > > > >>>>>>> like
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> the admin should notice that uncleanable partitions
> are
> >> > > > present,
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> and
> >> > > > >>>>>>
> >> > > > >>>>>>> then
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> check the logs?
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> * About `max.uncleanable.partitions`, you said that
> >> this
> >> > > > likely
> >> > > > >>>>>>>>>>>>>> indicates that the disk is having problems. I'm not
> >> sure
> >> > > > that
> >> > > > >>>>>>>>>>>>>> is
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> the
> >> > > > >>>>>>
> >> > > > >>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> problems,
> >> > > > >>>>>>
> >> > > > >>>>>>> all
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> of them are partition-level scenarios that happened
> >> > during
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> normal
> >> > > > >>>>
> >> > > > >>>>> operation. None of them were indicative of disk problems.
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> I don't think this is a meaningful comparison.  In
> >> > general,
> >> > > > we
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> don't
> >> > > > >>>>>>
> >> > > > >>>>>>> accept JIRAs for hard disk problems that happen on a
> >> particular
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> cluster.
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> If someone opened a JIRA that said "my hard disk is
> >> having
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> problems"
> >> > > > >>>>>>
> >> > > > >>>>>>> we
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> could close that as "not a Kafka bug."  This doesn't
> >> prove
> >> > > that
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> disk
> >> > > > >>>>>>
> >> > > > >>>>>>> problems don't happen, but  just that JIRA isn't the right
> >> > place
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> for
> >> > > > >>>>>>
> >> > > > >>>>>>> them.
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> I do agree that the log cleaner has had a
> significant
> >> > > number
> >> > > > of
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> logic
> >> > > > >>>>>>
> >> > > > >>>>>>> bugs, and that we need to be careful to limit their
> impact.
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> That's
> >> > > > >>>>
> >> > > > >>>>> one
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> reason why I think that a threshold of "number of
> >> > uncleanable
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> logs"
> >> > > > >>>>
> >> > > > >>>>> is
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> a
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> good idea, rather than just failing after one
> >> IOException.
> >> > > In
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> all
> >> > > > >>>>
> >> > > > >>>>> the
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> cases I've seen where a user hit a logic bug in the log
> >> > > cleaner,
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> it
> >> > > > >>>>
> >> > > > >>>>> was
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> just one partition that had the issue.  We also should
> >> > > increase
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> test
> >> > > > >>>>>>
> >> > > > >>>>>>> coverage for the log cleaner.
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> * About marking disks as offline when exceeding a
> >> certain
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> threshold,
> >> > > > >>>>>>
> >> > > > >>>>>>> that actually increases the blast radius of log compaction
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> failures.
> >> > > > >>>>>>
> >> > > > >>>>>>> Currently, the uncleaned partitions are still readable and
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> writable.
> >> > > > >>>>>>
> >> > > > >>>>>>> Taking the disks offline would impact availability of the
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> uncleanable
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> partitions, as well as impact all other partitions that
> >> are
> >> > on
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> the
> >> > > > >>>>
> >> > > > >>>>> disk.
> >> > > > >>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> In general, when we encounter I/O errors, we take
> the
> >> > disk
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>> partition
> >> > > > >>>>>>
> >> > > > >>>>>>> offline.  This is spelled out in KIP-112 (
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>
> >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%
> >> > > > >>>> 3A+Handle+disk+failure+for+JBOD
> >> > > > >>>>
> >> > > > >>>>> ) :
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> - Broker assumes a log directory to be good after it
> >> > > starts,
> >> > > > >>>>>>>>>>>>>> and
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> mark
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>> log directory as
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>> bad once there is IOException when broker attempts
> to
> >> > > access
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> (i.e.
> >> > > > >>>>
> >> > > > >>>>> read
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>> or write) the log directory.
> >> > > > >>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>>> - Broker will be offline if all log directories are
> >> bad.
> >> > > > >>>>>>>>>>>>>> - Broker will stop serving replicas in any bad log
> >> > > > directory.
> >> > > > >>>>>>>>>>>>>>
> >> > > > >>>>>>>>>>>>> New
> >> > > > >>>
> >> > > > >>>
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > > Best,
> >> > > Stanislav
> >> > >
> >> >
> >>
> >>
> >> --
> >> Best,
> >> Stanislav
> >>
> >
> >
>


-- 
Best,
Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Jason Gustafson <ja...@confluent.io>.

Sorry for the noise. Let me try again:

My initial suggestion was to *track *the uncleanable disk space.
> I can see why marking a log directory as offline after a certain threshold
> of uncleanable disk space is more useful.
> I'm not sure if we can set that threshold to be of certain size (e.g 100GB)
> as log directories might have different sizes.  Maybe a percentage would be
> better then (e.g 30% of whole log dir size), WDYT?


The two most common problems I am aware of when the log cleaner crashes are
1) running out of disk space and 2) excessive coordinator loading time. The
problem in the latter case is that when the log cleaner is not running, the
__consumer_offsets topics can become huge. If there is a failure which
causes a coordinator change, then it can take a long time for the new
coordinator to load the offset cache since it reads from the beginning.
Consumers are effectively dead in the water when this happens since they
cannot commit offsets. We've seen coordinator loading times in the hours
for some users. If we could set a total cap on the uncleanable size, then
we can reduce the impact from unbounded __consumer_offsets growth.

Also it's true that log directories may have different sizes, but I'm not
sure that is a common case. I don't think it would be too restrictive to
use a single max size for all directories. I think the key is just having
some way to cap the size of the uncleaned data.

I feel it still makes sense to have a metric tracking how many uncleanable
> partitions there are and the total amount of uncleanable disk space (per
> log dir, via a JMX tag).
> But now, rather than fail the log directory after a certain count of
> uncleanable partitions, we could fail it after a certain percentage (or
> size) of its storage is uncleanable.


Yes, having the metric for uncleanable partitions could be useful. I was
mostly concerned about the corresponding config since it didn't seem to
address the main problems with the cleaner dying.

Thanks,
Jason

On Tue, Aug 14, 2018 at 4:11 PM, Jason Gustafson <ja...@confluent.io> wrote:

> Hey Stanislav, responses below:
>
> My initial suggestion was to *track *the uncleanable disk space.
>> I can see why marking a log directory as offline after a certain threshold
>> of uncleanable disk space is more useful.
>> I'm not sure if we can set that threshold to be of certain size (e.g
>> 100GB)
>> as log directories might have different sizes.  Maybe a percentage would
>> be
>> better then (e.g 30% of whole log dir size), WDYT?
>
>
>
>
>
> On Fri, Aug 10, 2018 at 2:05 AM, Stanislav Kozlovski <
> stanislav@confluent.io> wrote:
>
>> Hey Jason,
>>
>> My initial suggestion was to *track *the uncleanable disk space.
>> I can see why marking a log directory as offline after a certain threshold
>> of uncleanable disk space is more useful.
>> I'm not sure if we can set that threshold to be of certain size (e.g
>> 100GB)
>> as log directories might have different sizes.  Maybe a percentage would
>> be
>> better then (e.g 30% of whole log dir size), WDYT?
>>
>> I feel it still makes sense to have a metric tracking how many uncleanable
>> partitions there are and the total amount of uncleanable disk space (per
>> log dir, via a JMX tag).
>> But now, rather than fail the log directory after a certain count of
>> uncleanable partitions, we could fail it after a certain percentage (or
>> size) of its storage is uncleanable.
>>
>> I'd like to hear other people's thoughts on this. Sound good?
>>
>> Best,
>> Stanislav
>>
>>
>>
>>
>> On Fri, Aug 10, 2018 at 12:40 AM Jason Gustafson <ja...@confluent.io>
>> wrote:
>>
>> > Hey Stanislav,
>> >
>> > Sorry, I was probably looking at an older version (I had the tab open
>> for
>> > so long!).
>> >
>> > I have been thinking about `max.uncleanable.partitions` and wondering if
>> > it's what we really want. The main risk if the cleaner cannot clean a
>> > partition is eventually running out of disk space. This is the most
>> common
>> > problem we have seen with cleaner failures and it can happen even if
>> there
>> > is just one uncleanable partition. We've actually seen cases in which a
>> > single __consumer_offsets grew large enough to fill a significant
>> portion
>> > of the disk. The difficulty with allowing a system to run out of disk
>> space
>> > before failing is that it makes recovery difficult and time consuming.
>> > Clean shutdown, for example, requires writing some state to disk.
>> Without
>> > clean shutdown, it can take the broker significantly longer to startup
>> > because it has do more segment recovery.
>> >
>> > For this problem, `max.uncleanable.partitions` does not really help. You
>> > can set it to 1 and fail fast, but that is not much better than the
>> > existing state. You had a suggestion previously in the KIP to use the
>> size
>> > of uncleanable disk space instead. What was the reason for rejecting
>> that?
>> > Intuitively, it seems like a better fit for a cleaner failure. It would
>> > provide users some time to react to failures while still protecting them
>> > from exhausting the disk.
>> >
>> > Thanks,
>> > Jason
>> >
>> >
>> >
>> >
>> > On Thu, Aug 9, 2018 at 9:46 AM, Stanislav Kozlovski <
>> > stanislav@confluent.io>
>> > wrote:
>> >
>> > > Hey Jason,
>> > >
>> > > 1. *10* is the default value, it says so in the KIP
>> > > 2. This is a good catch. As the current implementation stands, it's
>> not a
>> > > useful metric since the thread continues to run even if all log
>> > directories
>> > > are offline (although I'm not sure what the broker's behavior is in
>> that
>> > > scenario). I'll make sure the thread stops if all log directories are
>> > > online.
>> > >
>> > > I don't know which "Needs Discussion" item you're referencing, there
>> > hasn't
>> > > been any in the KIP since August 1 and that was for the metric only.
>> KIP
>> > > History
>> > > <https://cwiki.apache.org/confluence/pages/viewpreviousversi
>> ons.action?
>> > > pageId=89064875>
>> > >
>> > > I've updated the KIP to mention the "time-since-last-run" metric.
>> > >
>> > > Thanks,
>> > > Stanislav
>> > >
>> > > On Wed, Aug 8, 2018 at 12:12 AM Jason Gustafson <ja...@confluent.io>
>> > > wrote:
>> > >
>> > > > Hi Stanislav,
>> > > >
>> > > > Just a couple quick questions:
>> > > >
>> > > > 1. I may have missed it, but what will be the default value for
>> > > > `max.uncleanable.partitions`?
>> > > > 2. It seems there will be some impact for users that monitoring
>> > > > "time-since-last-run-ms" in order to detect cleaner failures. Not
>> sure
>> > > it's
>> > > > a major concern, but probably worth mentioning in the compatibility
>> > > > section. Also, is this still a useful metric after this KIP?
>> > > >
>> > > > Also, maybe the "Needs Discussion" item can be moved to rejected
>> > > > alternatives since you've moved to a vote? I think leaving this for
>> > > > potential future work is reasonable.
>> > > >
>> > > > Thanks,
>> > > > Jason
>> > > >
>> > > >
>> > > > On Mon, Aug 6, 2018 at 12:29 PM, Ray Chiang <rc...@apache.org>
>> > wrote:
>> > > >
>> > > > > I'm okay with that.
>> > > > >
>> > > > > -Ray
>> > > > >
>> > > > > On 8/6/18 10:59 AM, Colin McCabe wrote:
>> > > > >
>> > > > >> Perhaps we could start with max.uncleanable.partitions and then
>> > > > implement
>> > > > >> max.uncleanable.partitions.per.logdir in a follow-up change if
>> it
>> > > seemed
>> > > > >> to be necessary?  What do you think?
>> > > > >>
>> > > > >> regards,
>> > > > >> Colin
>> > > > >>
>> > > > >>
>> > > > >> On Sat, Aug 4, 2018, at 10:53, Stanislav Kozlovski wrote:
>> > > > >>
>> > > > >>> Hey Ray,
>> > > > >>>
>> > > > >>> Thanks for the explanation. In regards to the configuration
>> > property
>> > > -
>> > > > >>> I'm
>> > > > >>> not sure. As long as it has sufficient documentation, I find
>> > > > >>> "max.uncleanable.partitions" to be okay. If we were to add the
>> > > > >>> distinction
>> > > > >>> explicitly, maybe it should be `max.uncleanable.partitions.
>> > > per.logdir`
>> > > > ?
>> > > > >>>
>> > > > >>> On Thu, Aug 2, 2018 at 7:32 PM Ray Chiang <rc...@apache.org>
>> > > wrote:
>> > > > >>>
>> > > > >>> One more thing occurred to me.  Should the configuration
>> property
>> > be
>> > > > >>>> named "max.uncleanable.partitions.per.disk" instead?
>> > > > >>>>
>> > > > >>>> -Ray
>> > > > >>>>
>> > > > >>>>
>> > > > >>>> On 8/1/18 9:11 AM, Stanislav Kozlovski wrote:
>> > > > >>>>
>> > > > >>>>> Yes, good catch. Thank you, James!
>> > > > >>>>>
>> > > > >>>>> Best,
>> > > > >>>>> Stanislav
>> > > > >>>>>
>> > > > >>>>> On Wed, Aug 1, 2018 at 5:05 PM James Cheng <
>> wushujames@gmail.com
>> > >
>> > > > >>>>> wrote:
>> > > > >>>>>
>> > > > >>>>> Can you update the KIP to say what the default is for
>> > > > >>>>>> max.uncleanable.partitions?
>> > > > >>>>>>
>> > > > >>>>>> -James
>> > > > >>>>>>
>> > > > >>>>>> Sent from my iPhone
>> > > > >>>>>>
>> > > > >>>>>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <
>> > > > >>>>>>>
>> > > > >>>>>> stanislav@confluent.io>
>> > > > >>>>
>> > > > >>>>> wrote:
>> > > > >>>>>>
>> > > > >>>>>>> Hey group,
>> > > > >>>>>>>
>> > > > >>>>>>> I am planning on starting a voting thread tomorrow. Please
>> do
>> > > reply
>> > > > >>>>>>> if
>> > > > >>>>>>>
>> > > > >>>>>> you
>> > > > >>>>>>
>> > > > >>>>>>> feel there is anything left to discuss.
>> > > > >>>>>>>
>> > > > >>>>>>> Best,
>> > > > >>>>>>> Stanislav
>> > > > >>>>>>>
>> > > > >>>>>>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
>> > > > >>>>>>>
>> > > > >>>>>> stanislav@confluent.io>
>> > > > >>>>>>
>> > > > >>>>>>> wrote:
>> > > > >>>>>>>
>> > > > >>>>>>> Hey, Ray
>> > > > >>>>>>>>
>> > > > >>>>>>>> Thanks for pointing that out, it's fixed now
>> > > > >>>>>>>>
>> > > > >>>>>>>> Best,
>> > > > >>>>>>>> Stanislav
>> > > > >>>>>>>>
>> > > > >>>>>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <
>> > rchiang@apache.org>
>> > > > >>>>>>>>>
>> > > > >>>>>>>> wrote:
>> > > > >>>>
>> > > > >>>>> Thanks.  Can you fix the link in the "KIPs under discussion"
>> > table
>> > > on
>> > > > >>>>>>>>> the main KIP landing page
>> > > > >>>>>>>>> <
>> > > > >>>>>>>>>
>> > > > >>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+
>> > > > >>>> Improvement+Proposals#
>> > > > >>>>
>> > > > >>>>> ?
>> > > > >>>>>>>
>> > > > >>>>>>>> I tried, but the Wiki won't let me.
>> > > > >>>>>>>>>
>> > > > >>>>>>>>> -Ray
>> > > > >>>>>>>>>
>> > > > >>>>>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
>> > > > >>>>>>>>>> Hey guys,
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> @Colin - good point. I added some sentences mentioning
>> > recent
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>> improvements
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> in the introductory section.
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> *Disk Failure* - I tend to agree with what Colin said -
>> > once a
>> > > > >>>>>>>>>> disk
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>> fails,
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> you don't want to work with it again. As such, I've
>> changed
>> > my
>> > > > >>>>>>>>>> mind
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>> and
>> > > > >>>>>>
>> > > > >>>>>>> believe that we should mark the LogDir (assume its a disk)
>> as
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>> offline
>> > > > >>>>
>> > > > >>>>> on
>> > > > >>>>>>
>> > > > >>>>>>> the first `IOException` encountered. This is the
>> LogCleaner's
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>> current
>> > > > >>>>
>> > > > >>>>> behavior. We shouldn't change that.
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> *Respawning Threads* - I believe we should never
>> re-spawn a
>> > > > >>>>>>>>>> thread.
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>> The
>> > > > >>>>>>
>> > > > >>>>>>> correct approach in my mind is to either have it stay dead
>> or
>> > > never
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>> let
>> > > > >>>>>>
>> > > > >>>>>>> it
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> die in the first place.
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> *Uncleanable-partition-names metric* - Colin is right,
>> this
>> > > > metric
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>> is
>> > > > >>>>
>> > > > >>>>> unneeded. Users can monitor the `uncleanable-partitions-count`
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>> metric
>> > > > >>>>
>> > > > >>>>> and
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> inspect logs.
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> Hey Ray,
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> 2) I'm 100% with James in agreement with setting up the
>> > > > LogCleaner
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>> to
>> > > > >>>>
>> > > > >>>>> skip over problematic partitions instead of dying.
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>> I think we can do this for every exception that isn't
>> > > > >>>>>>>>>> `IOException`.
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>> This
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> will future-proof us against bugs in the system and
>> > potential
>> > > > >>>>>>>>>> other
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>> errors.
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> Protecting yourself against unexpected failures is
>> always a
>> > > good
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>> thing
>> > > > >>>>
>> > > > >>>>> in
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> my mind, but I also think that protecting yourself
>> against
>> > > bugs
>> > > > in
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>> the
>> > > > >>>>
>> > > > >>>>> software is sort of clunky. What does everybody think about
>> this?
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> 4) The only improvement I can think of is that if such an
>> > > > >>>>>>>>>>> error occurs, then have the option (configuration
>> setting?)
>> > > to
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>> create a
>> > > > >>>>>>
>> > > > >>>>>>> <log_segment>.skip file (or something similar).
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>> This is a good suggestion. Have others also seen
>> corruption
>> > be
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>> generally
>> > > > >>>>>>
>> > > > >>>>>>> tied to the same segment?
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <
>> > > > >>>>>>>>>> dhruvil@confluent.io
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>> wrote:
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> For the cleaner thread specifically, I do not think
>> > respawning
>> > > > >>>>>>>>>>> will
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>> help at
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> all because we are more than likely to run into the same
>> > issue
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>> again
>> > > > >>>>
>> > > > >>>>> which
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> would end up crashing the cleaner. Retrying makes sense
>> for
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>> transient
>> > > > >>>>
>> > > > >>>>> errors or when you believe some part of the system could have
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>> healed
>> > > > >>>>
>> > > > >>>>> itself, both of which I think are not true for the log
>> cleaner.
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <
>> > > > >>>>>>>>>>> rndgstn@gmail.com>
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>> wrote:
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> <<<respawning threads is likely to make things worse, by
>> > > putting
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> you
>> > > > >>>>
>> > > > >>>>> in
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> an
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> infinite loop which consumes resources and fires off
>> > > > continuous
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> log
>> > > > >>>>
>> > > > >>>>> messages.
>> > > > >>>>>>>>>>>> Hi Colin.  In case it could be relevant, one way to
>> > mitigate
>> > > > >>>>>>>>>>>> this
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> effect
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> is
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> to implement a backoff mechanism (if a second respawn
>> is
>> > to
>> > > > >>>>>>>>>>>> occur
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> then
>> > > > >>>>>>
>> > > > >>>>>>> wait
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> for 1 minute before doing it; then if a third respawn
>> is
>> > to
>> > > > >>>>>>>>>>>> occur
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> wait
>> > > > >>>>>>
>> > > > >>>>>>> for
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes,
>> etc.
>> > > up
>> > > > to
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> some
>> > > > >>>>>>
>> > > > >>>>>>> max
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> wait time).
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>> I have no opinion on whether respawn is appropriate or
>> not
>> > > in
>> > > > >>>>>>>>>>>> this
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> context,
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> but a mitigation like the increasing backoff described
>> > above
>> > > > may
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> be
>> > > > >>>>
>> > > > >>>>> relevant in weighing the pros and cons.
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>> Ron
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <
>> > > > >>>>>>>>>>>> cmccabe@apache.org>
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> wrote:
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
>> > > > >>>>>>>>>>>>>> Hi Stanislav! Thanks for this KIP!
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>>> I agree that it would be good if the LogCleaner were
>> > more
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> tolerant
>> > > > >>>>
>> > > > >>>>> of
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> errors. Currently, as you said, once it dies, it stays
>> dead.
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>>> Things are better now than they used to be. We have
>> the
>> > > > metric
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>>> kafka.log:type=LogCleanerManager,name=time-
>> > > since-last-run-ms
>> > > > >>>>
>> > > > >>>>> which we can use to tell us if the threads are dead. And as of
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> 1.1.0,
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> we
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> have KIP-226, which allows you to restart the log
>> cleaner
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> thread,
>> > > > >>>>
>> > > > >>>>> without requiring a broker restart.
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>>>
>> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+
>> > > > >>>> Dynamic+Broker+Configuration
>> > > > >>>>
>> > > > >>>>> <
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> https://cwiki.apache.org/confl
>> uence/display/KAFKA/KIP-
>> > > 226+-+
>> > > > >>>> Dynamic+Broker+Configuration
>> > > > >>>>
>> > > > >>>>> I've only read about this, I haven't personally tried it.
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> Thanks for pointing this out, James!  Stanislav, we
>> > should
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> probably
>> > > > >>>>
>> > > > >>>>> add a
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> sentence or two mentioning the KIP-226 changes
>> somewhere
>> > in
>> > > > the
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> KIP.
>> > > > >>>>>>
>> > > > >>>>>>> Maybe
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> in the intro section?
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> I think it's clear that requiring the users to
>> manually
>> > > > restart
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> the
>> > > > >>>>
>> > > > >>>>> log
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> cleaner is not a very good solution.  But it's good to
>> know
>> > > that
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> it's a
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> possibility on some older releases.
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> Some comments:
>> > > > >>>>>>>>>>>>>> * I like the idea of having the log cleaner continue
>> to
>> > > > clean
>> > > > >>>>>>>>>>>>>> as
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> many
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> partitions as it can, skipping over the problematic ones
>> if
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> possible.
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> * If the log cleaner thread dies, I think it should
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> automatically
>> > > > >>>>
>> > > > >>>>> be
>> > > > >>>>>>
>> > > > >>>>>>> revived. Your KIP attempts to do that by catching exceptions
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> during
>> > > > >>>>>>
>> > > > >>>>>>> execution, but I think we should go all the way and make
>> sure
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> that a
>> > > > >>>>>>
>> > > > >>>>>>> new
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> one gets created, if the thread ever dies.
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> This is inconsistent with the way the rest of Kafka
>> > works.
>> > > > We
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> don't
>> > > > >>>>>>
>> > > > >>>>>>> automatically re-create other threads in the broker if they
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> terminate.
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> In
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> general, if there is a serious bug in the code,
>> > respawning
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> threads
>> > > > >>>>
>> > > > >>>>> is
>> > > > >>>>>>
>> > > > >>>>>>> likely to make things worse, by putting you in an infinite
>> loop
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> which
>> > > > >>>>>>
>> > > > >>>>>>> consumes resources and fires off continuous log messages.
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> * It might be worth trying to re-clean the uncleanable
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> partitions.
>> > > > >>>>
>> > > > >>>>> I've
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> seen cases where an uncleanable partition later became
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> cleanable.
>> > > > >>>>
>> > > > >>>>> I
>> > > > >>>>>>
>> > > > >>>>>>> unfortunately don't remember how that happened, but I
>> remember
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> being
>> > > > >>>>>>
>> > > > >>>>>>> surprised when I discovered it. It might have been something
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> like
>> > > > >>>>
>> > > > >>>>> a
>> > > > >>>>>>
>> > > > >>>>>>> follower was uncleanable but after a leader election
>> happened,
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> the
>> > > > >>>>
>> > > > >>>>> log
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> truncated and it was then cleanable again. I'm not
>> sure.
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> James, I disagree.  We had this behavior in the Hadoop
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> Distributed
>> > > > >>>>
>> > > > >>>>> File
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> System (HDFS) and it was a constant source of user
>> problems.
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> What would happen is disks would just go bad over
>> time.
>> > > The
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> DataNode
>> > > > >>>>>>
>> > > > >>>>>>> would notice this and take them offline.  But then, due to
>> some
>> > > > >>>>>>>>>>>>> "optimistic" code, the DataNode would periodically
>> try to
>> > > > >>>>>>>>>>>>> re-add
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> them
>> > > > >>>>>>
>> > > > >>>>>>> to
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> the system.  Then one of two things would happen: the
>> disk
>> > > > would
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> just
>> > > > >>>>>>
>> > > > >>>>>>> fail
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> immediately again, or it would appear to work and then
>> > fail
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> after a
>> > > > >>>>
>> > > > >>>>> short
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> amount of time.
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> The way the disk failed was normally having an I/O
>> > request
>> > > > >>>>>>>>>>>>> take a
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> really
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> long time and time out.  So a bunch of request handler
>> > > threads
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> would
>> > > > >>>>>>
>> > > > >>>>>>> basically slam into a brick wall when they tried to access
>> the
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> bad
>> > > > >>>>
>> > > > >>>>> disk,
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> slowing the DataNode to a crawl.  It was even worse in
>> the
>> > > > >>>>>>>>>>>>> second
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> scenario,
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> if the disk appeared to work for a while, but then
>> > failed.
>> > > > Any
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> data
>> > > > >>>>>>
>> > > > >>>>>>> that
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> had been written on that DataNode to that disk would be
>> > > lost,
>> > > > >>>>>>>>>>>>> and
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> we
>> > > > >>>>>>
>> > > > >>>>>>> would
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> need to re-replicate it.
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> Disks aren't biological systems-- they don't heal over
>> > > time.
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> Once
>> > > > >>>>
>> > > > >>>>> they're
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> bad, they stay bad.  The log cleaner needs to be
>> robust
>> > > > against
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> cases
>> > > > >>>>>>
>> > > > >>>>>>> where
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> the disk really is failing, and really is returning
>> bad
>> > > data
>> > > > or
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> timing
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> out.
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> * For your metrics, can you spell out the full metric
>> in
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> JMX-style
>> > > > >>>>
>> > > > >>>>> format, such as:
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>>>   kafka.log:type=LogCleanerManager,name=uncleanable-
>> > > > >>>> partitions-count
>> > > > >>>>
>> > > > >>>>>                 value=4
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>>> * For "uncleanable-partitions": topic-partition names
>> > can
>> > > be
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> very
>> > > > >>>>
>> > > > >>>>> long.
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> I think the current max size is 210 characters (or
>> maybe
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> 240-ish?).
>> > > > >>>>>>
>> > > > >>>>>>> Having the "uncleanable-partitions" being a list could be
>> very
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> large
>> > > > >>>>>>
>> > > > >>>>>>> metric. Also, having the metric come out as a csv might be
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> difficult
>> > > > >>>>>>
>> > > > >>>>>>> to
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> work with for monitoring systems. If we *did* want the
>> > topic
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> names
>> > > > >>>>
>> > > > >>>>> to
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> be
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> accessible, what do you think of having the
>> > > > >>>>>>>>>>>>>>         kafka.log:type=LogCleanerManag
>> > > > >>>>>>>>>>>>>> er,topic=topic1,partition=2
>> > > > >>>>>>>>>>>>>> I'm not sure if LogCleanerManager is the right type,
>> but
>> > > my
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> example
>> > > > >>>>>>
>> > > > >>>>>>> was
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> that the topic and partition can be tags in the metric.
>> > That
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> will
>> > > > >>>>
>> > > > >>>>> allow
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> monitoring systems to more easily slice and dice the
>> > metric.
>> > > > I'm
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> not
>> > > > >>>>>>
>> > > > >>>>>>> sure what the attribute for that metric would be. Maybe
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> something
>> > > > >>>>
>> > > > >>>>> like
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> time-since-last-clean?
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> Or
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> maybe even just "Value=1".
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> I haven't though about this that hard, but do we
>> really
>> > > need
>> > > > >>>>>>>>>>>>> the
>> > > > >>>>>>>>>>>>> uncleanable topic names to be accessible through a
>> > metric?
>> > > > It
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> seems
>> > > > >>>>>>
>> > > > >>>>>>> like
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> the admin should notice that uncleanable partitions are
>> > > > present,
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> and
>> > > > >>>>>>
>> > > > >>>>>>> then
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> check the logs?
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> * About `max.uncleanable.partitions`, you said that
>> this
>> > > > likely
>> > > > >>>>>>>>>>>>>> indicates that the disk is having problems. I'm not
>> sure
>> > > > that
>> > > > >>>>>>>>>>>>>> is
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> the
>> > > > >>>>>>
>> > > > >>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> problems,
>> > > > >>>>>>
>> > > > >>>>>>> all
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> of them are partition-level scenarios that happened
>> > during
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> normal
>> > > > >>>>
>> > > > >>>>> operation. None of them were indicative of disk problems.
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> I don't think this is a meaningful comparison.  In
>> > general,
>> > > > we
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> don't
>> > > > >>>>>>
>> > > > >>>>>>> accept JIRAs for hard disk problems that happen on a
>> particular
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> cluster.
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> If someone opened a JIRA that said "my hard disk is
>> having
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> problems"
>> > > > >>>>>>
>> > > > >>>>>>> we
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> could close that as "not a Kafka bug."  This doesn't
>> prove
>> > > that
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> disk
>> > > > >>>>>>
>> > > > >>>>>>> problems don't happen, but  just that JIRA isn't the right
>> > place
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> for
>> > > > >>>>>>
>> > > > >>>>>>> them.
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> I do agree that the log cleaner has had a significant
>> > > number
>> > > > of
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> logic
>> > > > >>>>>>
>> > > > >>>>>>> bugs, and that we need to be careful to limit their impact.
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> That's
>> > > > >>>>
>> > > > >>>>> one
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> reason why I think that a threshold of "number of
>> > uncleanable
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> logs"
>> > > > >>>>
>> > > > >>>>> is
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> a
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> good idea, rather than just failing after one
>> IOException.
>> > > In
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> all
>> > > > >>>>
>> > > > >>>>> the
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> cases I've seen where a user hit a logic bug in the log
>> > > cleaner,
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> it
>> > > > >>>>
>> > > > >>>>> was
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> just one partition that had the issue.  We also should
>> > > increase
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> test
>> > > > >>>>>>
>> > > > >>>>>>> coverage for the log cleaner.
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> * About marking disks as offline when exceeding a
>> certain
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> threshold,
>> > > > >>>>>>
>> > > > >>>>>>> that actually increases the blast radius of log compaction
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> failures.
>> > > > >>>>>>
>> > > > >>>>>>> Currently, the uncleaned partitions are still readable and
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> writable.
>> > > > >>>>>>
>> > > > >>>>>>> Taking the disks offline would impact availability of the
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> uncleanable
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> partitions, as well as impact all other partitions that
>> are
>> > on
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> the
>> > > > >>>>
>> > > > >>>>> disk.
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> In general, when we encounter I/O errors, we take the
>> > disk
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>> partition
>> > > > >>>>>>
>> > > > >>>>>>> offline.  This is spelled out in KIP-112 (
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>>
>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%
>> > > > >>>> 3A+Handle+disk+failure+for+JBOD
>> > > > >>>>
>> > > > >>>>> ) :
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> - Broker assumes a log directory to be good after it
>> > > starts,
>> > > > >>>>>>>>>>>>>> and
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> mark
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> log directory as
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>>> bad once there is IOException when broker attempts to
>> > > access
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> (i.e.
>> > > > >>>>
>> > > > >>>>> read
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>> or write) the log directory.
>> > > > >>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>>> - Broker will be offline if all log directories are
>> bad.
>> > > > >>>>>>>>>>>>>> - Broker will stop serving replicas in any bad log
>> > > > directory.
>> > > > >>>>>>>>>>>>>>
>> > > > >>>>>>>>>>>>> New
>> > > > >>>
>> > > > >>>
>> > > >
>> > >
>> > >
>> > > --
>> > > Best,
>> > > Stanislav
>> > >
>> >
>>
>>
>> --
>> Best,
>> Stanislav
>>
>
>

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Jason Gustafson <ja...@confluent.io>.

Hey Stanislav, responses below:

My initial suggestion was to *track *the uncleanable disk space.
> I can see why marking a log directory as offline after a certain threshold
> of uncleanable disk space is more useful.
> I'm not sure if we can set that threshold to be of certain size (e.g 100GB)
> as log directories might have different sizes.  Maybe a percentage would be
> better then (e.g 30% of whole log dir size), WDYT?





On Fri, Aug 10, 2018 at 2:05 AM, Stanislav Kozlovski <stanislav@confluent.io
> wrote:

> Hey Jason,
>
> My initial suggestion was to *track *the uncleanable disk space.
> I can see why marking a log directory as offline after a certain threshold
> of uncleanable disk space is more useful.
> I'm not sure if we can set that threshold to be of certain size (e.g 100GB)
> as log directories might have different sizes.  Maybe a percentage would be
> better then (e.g 30% of whole log dir size), WDYT?
>
> I feel it still makes sense to have a metric tracking how many uncleanable
> partitions there are and the total amount of uncleanable disk space (per
> log dir, via a JMX tag).
> But now, rather than fail the log directory after a certain count of
> uncleanable partitions, we could fail it after a certain percentage (or
> size) of its storage is uncleanable.
>
> I'd like to hear other people's thoughts on this. Sound good?
>
> Best,
> Stanislav
>
>
>
>
> On Fri, Aug 10, 2018 at 12:40 AM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hey Stanislav,
> >
> > Sorry, I was probably looking at an older version (I had the tab open for
> > so long!).
> >
> > I have been thinking about `max.uncleanable.partitions` and wondering if
> > it's what we really want. The main risk if the cleaner cannot clean a
> > partition is eventually running out of disk space. This is the most
> common
> > problem we have seen with cleaner failures and it can happen even if
> there
> > is just one uncleanable partition. We've actually seen cases in which a
> > single __consumer_offsets grew large enough to fill a significant portion
> > of the disk. The difficulty with allowing a system to run out of disk
> space
> > before failing is that it makes recovery difficult and time consuming.
> > Clean shutdown, for example, requires writing some state to disk. Without
> > clean shutdown, it can take the broker significantly longer to startup
> > because it has do more segment recovery.
> >
> > For this problem, `max.uncleanable.partitions` does not really help. You
> > can set it to 1 and fail fast, but that is not much better than the
> > existing state. You had a suggestion previously in the KIP to use the
> size
> > of uncleanable disk space instead. What was the reason for rejecting
> that?
> > Intuitively, it seems like a better fit for a cleaner failure. It would
> > provide users some time to react to failures while still protecting them
> > from exhausting the disk.
> >
> > Thanks,
> > Jason
> >
> >
> >
> >
> > On Thu, Aug 9, 2018 at 9:46 AM, Stanislav Kozlovski <
> > stanislav@confluent.io>
> > wrote:
> >
> > > Hey Jason,
> > >
> > > 1. *10* is the default value, it says so in the KIP
> > > 2. This is a good catch. As the current implementation stands, it's
> not a
> > > useful metric since the thread continues to run even if all log
> > directories
> > > are offline (although I'm not sure what the broker's behavior is in
> that
> > > scenario). I'll make sure the thread stops if all log directories are
> > > online.
> > >
> > > I don't know which "Needs Discussion" item you're referencing, there
> > hasn't
> > > been any in the KIP since August 1 and that was for the metric only.
> KIP
> > > History
> > > <https://cwiki.apache.org/confluence/pages/viewpreviousversions.action
> ?
> > > pageId=89064875>
> > >
> > > I've updated the KIP to mention the "time-since-last-run" metric.
> > >
> > > Thanks,
> > > Stanislav
> > >
> > > On Wed, Aug 8, 2018 at 12:12 AM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hi Stanislav,
> > > >
> > > > Just a couple quick questions:
> > > >
> > > > 1. I may have missed it, but what will be the default value for
> > > > `max.uncleanable.partitions`?
> > > > 2. It seems there will be some impact for users that monitoring
> > > > "time-since-last-run-ms" in order to detect cleaner failures. Not
> sure
> > > it's
> > > > a major concern, but probably worth mentioning in the compatibility
> > > > section. Also, is this still a useful metric after this KIP?
> > > >
> > > > Also, maybe the "Needs Discussion" item can be moved to rejected
> > > > alternatives since you've moved to a vote? I think leaving this for
> > > > potential future work is reasonable.
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > >
> > > > On Mon, Aug 6, 2018 at 12:29 PM, Ray Chiang <rc...@apache.org>
> > wrote:
> > > >
> > > > > I'm okay with that.
> > > > >
> > > > > -Ray
> > > > >
> > > > > On 8/6/18 10:59 AM, Colin McCabe wrote:
> > > > >
> > > > >> Perhaps we could start with max.uncleanable.partitions and then
> > > > implement
> > > > >> max.uncleanable.partitions.per.logdir in a follow-up change if it
> > > seemed
> > > > >> to be necessary?  What do you think?
> > > > >>
> > > > >> regards,
> > > > >> Colin
> > > > >>
> > > > >>
> > > > >> On Sat, Aug 4, 2018, at 10:53, Stanislav Kozlovski wrote:
> > > > >>
> > > > >>> Hey Ray,
> > > > >>>
> > > > >>> Thanks for the explanation. In regards to the configuration
> > property
> > > -
> > > > >>> I'm
> > > > >>> not sure. As long as it has sufficient documentation, I find
> > > > >>> "max.uncleanable.partitions" to be okay. If we were to add the
> > > > >>> distinction
> > > > >>> explicitly, maybe it should be `max.uncleanable.partitions.
> > > per.logdir`
> > > > ?
> > > > >>>
> > > > >>> On Thu, Aug 2, 2018 at 7:32 PM Ray Chiang <rc...@apache.org>
> > > wrote:
> > > > >>>
> > > > >>> One more thing occurred to me.  Should the configuration property
> > be
> > > > >>>> named "max.uncleanable.partitions.per.disk" instead?
> > > > >>>>
> > > > >>>> -Ray
> > > > >>>>
> > > > >>>>
> > > > >>>> On 8/1/18 9:11 AM, Stanislav Kozlovski wrote:
> > > > >>>>
> > > > >>>>> Yes, good catch. Thank you, James!
> > > > >>>>>
> > > > >>>>> Best,
> > > > >>>>> Stanislav
> > > > >>>>>
> > > > >>>>> On Wed, Aug 1, 2018 at 5:05 PM James Cheng <
> wushujames@gmail.com
> > >
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>> Can you update the KIP to say what the default is for
> > > > >>>>>> max.uncleanable.partitions?
> > > > >>>>>>
> > > > >>>>>> -James
> > > > >>>>>>
> > > > >>>>>> Sent from my iPhone
> > > > >>>>>>
> > > > >>>>>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <
> > > > >>>>>>>
> > > > >>>>>> stanislav@confluent.io>
> > > > >>>>
> > > > >>>>> wrote:
> > > > >>>>>>
> > > > >>>>>>> Hey group,
> > > > >>>>>>>
> > > > >>>>>>> I am planning on starting a voting thread tomorrow. Please do
> > > reply
> > > > >>>>>>> if
> > > > >>>>>>>
> > > > >>>>>> you
> > > > >>>>>>
> > > > >>>>>>> feel there is anything left to discuss.
> > > > >>>>>>>
> > > > >>>>>>> Best,
> > > > >>>>>>> Stanislav
> > > > >>>>>>>
> > > > >>>>>>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
> > > > >>>>>>>
> > > > >>>>>> stanislav@confluent.io>
> > > > >>>>>>
> > > > >>>>>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>> Hey, Ray
> > > > >>>>>>>>
> > > > >>>>>>>> Thanks for pointing that out, it's fixed now
> > > > >>>>>>>>
> > > > >>>>>>>> Best,
> > > > >>>>>>>> Stanislav
> > > > >>>>>>>>
> > > > >>>>>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <
> > rchiang@apache.org>
> > > > >>>>>>>>>
> > > > >>>>>>>> wrote:
> > > > >>>>
> > > > >>>>> Thanks.  Can you fix the link in the "KIPs under discussion"
> > table
> > > on
> > > > >>>>>>>>> the main KIP landing page
> > > > >>>>>>>>> <
> > > > >>>>>>>>>
> > > > >>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+
> > > > >>>> Improvement+Proposals#
> > > > >>>>
> > > > >>>>> ?
> > > > >>>>>>>
> > > > >>>>>>>> I tried, but the Wiki won't let me.
> > > > >>>>>>>>>
> > > > >>>>>>>>> -Ray
> > > > >>>>>>>>>
> > > > >>>>>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
> > > > >>>>>>>>>> Hey guys,
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> @Colin - good point. I added some sentences mentioning
> > recent
> > > > >>>>>>>>>>
> > > > >>>>>>>>> improvements
> > > > >>>>>>>>>
> > > > >>>>>>>>>> in the introductory section.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> *Disk Failure* - I tend to agree with what Colin said -
> > once a
> > > > >>>>>>>>>> disk
> > > > >>>>>>>>>>
> > > > >>>>>>>>> fails,
> > > > >>>>>>>>>
> > > > >>>>>>>>>> you don't want to work with it again. As such, I've
> changed
> > my
> > > > >>>>>>>>>> mind
> > > > >>>>>>>>>>
> > > > >>>>>>>>> and
> > > > >>>>>>
> > > > >>>>>>> believe that we should mark the LogDir (assume its a disk) as
> > > > >>>>>>>>>>
> > > > >>>>>>>>> offline
> > > > >>>>
> > > > >>>>> on
> > > > >>>>>>
> > > > >>>>>>> the first `IOException` encountered. This is the LogCleaner's
> > > > >>>>>>>>>>
> > > > >>>>>>>>> current
> > > > >>>>
> > > > >>>>> behavior. We shouldn't change that.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> *Respawning Threads* - I believe we should never re-spawn
> a
> > > > >>>>>>>>>> thread.
> > > > >>>>>>>>>>
> > > > >>>>>>>>> The
> > > > >>>>>>
> > > > >>>>>>> correct approach in my mind is to either have it stay dead or
> > > never
> > > > >>>>>>>>>>
> > > > >>>>>>>>> let
> > > > >>>>>>
> > > > >>>>>>> it
> > > > >>>>>>>>>
> > > > >>>>>>>>>> die in the first place.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> *Uncleanable-partition-names metric* - Colin is right,
> this
> > > > metric
> > > > >>>>>>>>>>
> > > > >>>>>>>>> is
> > > > >>>>
> > > > >>>>> unneeded. Users can monitor the `uncleanable-partitions-count`
> > > > >>>>>>>>>>
> > > > >>>>>>>>> metric
> > > > >>>>
> > > > >>>>> and
> > > > >>>>>>>>>
> > > > >>>>>>>>>> inspect logs.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Hey Ray,
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> 2) I'm 100% with James in agreement with setting up the
> > > > LogCleaner
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> to
> > > > >>>>
> > > > >>>>> skip over problematic partitions instead of dying.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> I think we can do this for every exception that isn't
> > > > >>>>>>>>>> `IOException`.
> > > > >>>>>>>>>>
> > > > >>>>>>>>> This
> > > > >>>>>>>>>
> > > > >>>>>>>>>> will future-proof us against bugs in the system and
> > potential
> > > > >>>>>>>>>> other
> > > > >>>>>>>>>>
> > > > >>>>>>>>> errors.
> > > > >>>>>>>>>
> > > > >>>>>>>>>> Protecting yourself against unexpected failures is always
> a
> > > good
> > > > >>>>>>>>>>
> > > > >>>>>>>>> thing
> > > > >>>>
> > > > >>>>> in
> > > > >>>>>>>>>
> > > > >>>>>>>>>> my mind, but I also think that protecting yourself against
> > > bugs
> > > > in
> > > > >>>>>>>>>>
> > > > >>>>>>>>> the
> > > > >>>>
> > > > >>>>> software is sort of clunky. What does everybody think about
> this?
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> 4) The only improvement I can think of is that if such an
> > > > >>>>>>>>>>> error occurs, then have the option (configuration
> setting?)
> > > to
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> create a
> > > > >>>>>>
> > > > >>>>>>> <log_segment>.skip file (or something similar).
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> This is a good suggestion. Have others also seen
> corruption
> > be
> > > > >>>>>>>>>>
> > > > >>>>>>>>> generally
> > > > >>>>>>
> > > > >>>>>>> tied to the same segment?
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <
> > > > >>>>>>>>>> dhruvil@confluent.io
> > > > >>>>>>>>>>
> > > > >>>>>>>>> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>>> For the cleaner thread specifically, I do not think
> > respawning
> > > > >>>>>>>>>>> will
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> help at
> > > > >>>>>>>>>
> > > > >>>>>>>>>> all because we are more than likely to run into the same
> > issue
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> again
> > > > >>>>
> > > > >>>>> which
> > > > >>>>>>>>>
> > > > >>>>>>>>>> would end up crashing the cleaner. Retrying makes sense
> for
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> transient
> > > > >>>>
> > > > >>>>> errors or when you believe some part of the system could have
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> healed
> > > > >>>>
> > > > >>>>> itself, both of which I think are not true for the log cleaner.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <
> > > > >>>>>>>>>>> rndgstn@gmail.com>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>>> <<<respawning threads is likely to make things worse, by
> > > putting
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> you
> > > > >>>>
> > > > >>>>> in
> > > > >>>>>>>>>
> > > > >>>>>>>>>> an
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> infinite loop which consumes resources and fires off
> > > > continuous
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> log
> > > > >>>>
> > > > >>>>> messages.
> > > > >>>>>>>>>>>> Hi Colin.  In case it could be relevant, one way to
> > mitigate
> > > > >>>>>>>>>>>> this
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> effect
> > > > >>>>>>>>>
> > > > >>>>>>>>>> is
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> to implement a backoff mechanism (if a second respawn is
> > to
> > > > >>>>>>>>>>>> occur
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> then
> > > > >>>>>>
> > > > >>>>>>> wait
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> for 1 minute before doing it; then if a third respawn is
> > to
> > > > >>>>>>>>>>>> occur
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> wait
> > > > >>>>>>
> > > > >>>>>>> for
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes,
> etc.
> > > up
> > > > to
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> some
> > > > >>>>>>
> > > > >>>>>>> max
> > > > >>>>>>>>>
> > > > >>>>>>>>>> wait time).
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> I have no opinion on whether respawn is appropriate or
> not
> > > in
> > > > >>>>>>>>>>>> this
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> context,
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> but a mitigation like the increasing backoff described
> > above
> > > > may
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> be
> > > > >>>>
> > > > >>>>> relevant in weighing the pros and cons.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Ron
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <
> > > > >>>>>>>>>>>> cmccabe@apache.org>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> > > > >>>>>>>>>>>>>> Hi Stanislav! Thanks for this KIP!
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> I agree that it would be good if the LogCleaner were
> > more
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> tolerant
> > > > >>>>
> > > > >>>>> of
> > > > >>>>>>>>>
> > > > >>>>>>>>>> errors. Currently, as you said, once it dies, it stays
> dead.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Things are better now than they used to be. We have
> the
> > > > metric
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> kafka.log:type=LogCleanerManager,name=time-
> > > since-last-run-ms
> > > > >>>>
> > > > >>>>> which we can use to tell us if the threads are dead. And as of
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> 1.1.0,
> > > > >>>>>>>>>
> > > > >>>>>>>>>> we
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> have KIP-226, which allows you to restart the log
> cleaner
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> thread,
> > > > >>>>
> > > > >>>>> without requiring a broker restart.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+
> > > > >>>> Dynamic+Broker+Configuration
> > > > >>>>
> > > > >>>>> <
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 226+-+
> > > > >>>> Dynamic+Broker+Configuration
> > > > >>>>
> > > > >>>>> I've only read about this, I haven't personally tried it.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Thanks for pointing this out, James!  Stanislav, we
> > should
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> probably
> > > > >>>>
> > > > >>>>> add a
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere
> > in
> > > > the
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> KIP.
> > > > >>>>>>
> > > > >>>>>>> Maybe
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> in the intro section?
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> I think it's clear that requiring the users to manually
> > > > restart
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> the
> > > > >>>>
> > > > >>>>> log
> > > > >>>>>>>>>
> > > > >>>>>>>>>> cleaner is not a very good solution.  But it's good to
> know
> > > that
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> it's a
> > > > >>>>>>>>>
> > > > >>>>>>>>>> possibility on some older releases.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Some comments:
> > > > >>>>>>>>>>>>>> * I like the idea of having the log cleaner continue
> to
> > > > clean
> > > > >>>>>>>>>>>>>> as
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> many
> > > > >>>>>>>>>
> > > > >>>>>>>>>> partitions as it can, skipping over the problematic ones
> if
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> possible.
> > > > >>>>>>>>>
> > > > >>>>>>>>>> * If the log cleaner thread dies, I think it should
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> automatically
> > > > >>>>
> > > > >>>>> be
> > > > >>>>>>
> > > > >>>>>>> revived. Your KIP attempts to do that by catching exceptions
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> during
> > > > >>>>>>
> > > > >>>>>>> execution, but I think we should go all the way and make sure
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> that a
> > > > >>>>>>
> > > > >>>>>>> new
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> one gets created, if the thread ever dies.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> This is inconsistent with the way the rest of Kafka
> > works.
> > > > We
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> don't
> > > > >>>>>>
> > > > >>>>>>> automatically re-create other threads in the broker if they
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> terminate.
> > > > >>>>>>>>>
> > > > >>>>>>>>>> In
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> general, if there is a serious bug in the code,
> > respawning
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> threads
> > > > >>>>
> > > > >>>>> is
> > > > >>>>>>
> > > > >>>>>>> likely to make things worse, by putting you in an infinite
> loop
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> which
> > > > >>>>>>
> > > > >>>>>>> consumes resources and fires off continuous log messages.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> * It might be worth trying to re-clean the uncleanable
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> partitions.
> > > > >>>>
> > > > >>>>> I've
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> seen cases where an uncleanable partition later became
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> cleanable.
> > > > >>>>
> > > > >>>>> I
> > > > >>>>>>
> > > > >>>>>>> unfortunately don't remember how that happened, but I
> remember
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> being
> > > > >>>>>>
> > > > >>>>>>> surprised when I discovered it. It might have been something
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> like
> > > > >>>>
> > > > >>>>> a
> > > > >>>>>>
> > > > >>>>>>> follower was uncleanable but after a leader election
> happened,
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> the
> > > > >>>>
> > > > >>>>> log
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> truncated and it was then cleanable again. I'm not sure.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> James, I disagree.  We had this behavior in the Hadoop
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> Distributed
> > > > >>>>
> > > > >>>>> File
> > > > >>>>>>>>>
> > > > >>>>>>>>>> System (HDFS) and it was a constant source of user
> problems.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> What would happen is disks would just go bad over time.
> > > The
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> DataNode
> > > > >>>>>>
> > > > >>>>>>> would notice this and take them offline.  But then, due to
> some
> > > > >>>>>>>>>>>>> "optimistic" code, the DataNode would periodically try
> to
> > > > >>>>>>>>>>>>> re-add
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> them
> > > > >>>>>>
> > > > >>>>>>> to
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> the system.  Then one of two things would happen: the
> disk
> > > > would
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> just
> > > > >>>>>>
> > > > >>>>>>> fail
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> immediately again, or it would appear to work and then
> > fail
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> after a
> > > > >>>>
> > > > >>>>> short
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> amount of time.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> The way the disk failed was normally having an I/O
> > request
> > > > >>>>>>>>>>>>> take a
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> really
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> long time and time out.  So a bunch of request handler
> > > threads
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> would
> > > > >>>>>>
> > > > >>>>>>> basically slam into a brick wall when they tried to access
> the
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> bad
> > > > >>>>
> > > > >>>>> disk,
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> slowing the DataNode to a crawl.  It was even worse in
> the
> > > > >>>>>>>>>>>>> second
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> scenario,
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> if the disk appeared to work for a while, but then
> > failed.
> > > > Any
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> data
> > > > >>>>>>
> > > > >>>>>>> that
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> had been written on that DataNode to that disk would be
> > > lost,
> > > > >>>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> we
> > > > >>>>>>
> > > > >>>>>>> would
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> need to re-replicate it.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Disks aren't biological systems-- they don't heal over
> > > time.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> Once
> > > > >>>>
> > > > >>>>> they're
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> bad, they stay bad.  The log cleaner needs to be robust
> > > > against
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> cases
> > > > >>>>>>
> > > > >>>>>>> where
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> the disk really is failing, and really is returning bad
> > > data
> > > > or
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> timing
> > > > >>>>>>>>>
> > > > >>>>>>>>>> out.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> * For your metrics, can you spell out the full metric
> in
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> JMX-style
> > > > >>>>
> > > > >>>>> format, such as:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>   kafka.log:type=LogCleanerManager,name=uncleanable-
> > > > >>>> partitions-count
> > > > >>>>
> > > > >>>>>                 value=4
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> * For "uncleanable-partitions": topic-partition names
> > can
> > > be
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> very
> > > > >>>>
> > > > >>>>> long.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> I think the current max size is 210 characters (or maybe
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> 240-ish?).
> > > > >>>>>>
> > > > >>>>>>> Having the "uncleanable-partitions" being a list could be
> very
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> large
> > > > >>>>>>
> > > > >>>>>>> metric. Also, having the metric come out as a csv might be
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> difficult
> > > > >>>>>>
> > > > >>>>>>> to
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> work with for monitoring systems. If we *did* want the
> > topic
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> names
> > > > >>>>
> > > > >>>>> to
> > > > >>>>>>>>>
> > > > >>>>>>>>>> be
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> accessible, what do you think of having the
> > > > >>>>>>>>>>>>>>         kafka.log:type=LogCleanerManag
> > > > >>>>>>>>>>>>>> er,topic=topic1,partition=2
> > > > >>>>>>>>>>>>>> I'm not sure if LogCleanerManager is the right type,
> but
> > > my
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> example
> > > > >>>>>>
> > > > >>>>>>> was
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> that the topic and partition can be tags in the metric.
> > That
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> will
> > > > >>>>
> > > > >>>>> allow
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> monitoring systems to more easily slice and dice the
> > metric.
> > > > I'm
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> not
> > > > >>>>>>
> > > > >>>>>>> sure what the attribute for that metric would be. Maybe
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> something
> > > > >>>>
> > > > >>>>> like
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> time-since-last-clean?
> > > > >>>>>>>>>
> > > > >>>>>>>>>> Or
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> maybe even just "Value=1".
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> I haven't though about this that hard, but do we really
> > > need
> > > > >>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>> uncleanable topic names to be accessible through a
> > metric?
> > > > It
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> seems
> > > > >>>>>>
> > > > >>>>>>> like
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> the admin should notice that uncleanable partitions are
> > > > present,
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> and
> > > > >>>>>>
> > > > >>>>>>> then
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> check the logs?
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> * About `max.uncleanable.partitions`, you said that
> this
> > > > likely
> > > > >>>>>>>>>>>>>> indicates that the disk is having problems. I'm not
> sure
> > > > that
> > > > >>>>>>>>>>>>>> is
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> the
> > > > >>>>>>
> > > > >>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> problems,
> > > > >>>>>>
> > > > >>>>>>> all
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> of them are partition-level scenarios that happened
> > during
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> normal
> > > > >>>>
> > > > >>>>> operation. None of them were indicative of disk problems.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> I don't think this is a meaningful comparison.  In
> > general,
> > > > we
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> don't
> > > > >>>>>>
> > > > >>>>>>> accept JIRAs for hard disk problems that happen on a
> particular
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> cluster.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> If someone opened a JIRA that said "my hard disk is
> having
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> problems"
> > > > >>>>>>
> > > > >>>>>>> we
> > > > >>>>>>>>>
> > > > >>>>>>>>>> could close that as "not a Kafka bug."  This doesn't prove
> > > that
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> disk
> > > > >>>>>>
> > > > >>>>>>> problems don't happen, but  just that JIRA isn't the right
> > place
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> for
> > > > >>>>>>
> > > > >>>>>>> them.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> I do agree that the log cleaner has had a significant
> > > number
> > > > of
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> logic
> > > > >>>>>>
> > > > >>>>>>> bugs, and that we need to be careful to limit their impact.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> That's
> > > > >>>>
> > > > >>>>> one
> > > > >>>>>>>>>
> > > > >>>>>>>>>> reason why I think that a threshold of "number of
> > uncleanable
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> logs"
> > > > >>>>
> > > > >>>>> is
> > > > >>>>>>>>>
> > > > >>>>>>>>>> a
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> good idea, rather than just failing after one
> IOException.
> > > In
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> all
> > > > >>>>
> > > > >>>>> the
> > > > >>>>>>>>>
> > > > >>>>>>>>>> cases I've seen where a user hit a logic bug in the log
> > > cleaner,
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> it
> > > > >>>>
> > > > >>>>> was
> > > > >>>>>>>>>
> > > > >>>>>>>>>> just one partition that had the issue.  We also should
> > > increase
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> test
> > > > >>>>>>
> > > > >>>>>>> coverage for the log cleaner.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> * About marking disks as offline when exceeding a
> certain
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> threshold,
> > > > >>>>>>
> > > > >>>>>>> that actually increases the blast radius of log compaction
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> failures.
> > > > >>>>>>
> > > > >>>>>>> Currently, the uncleaned partitions are still readable and
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> writable.
> > > > >>>>>>
> > > > >>>>>>> Taking the disks offline would impact availability of the
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> uncleanable
> > > > >>>>>>>>>
> > > > >>>>>>>>>> partitions, as well as impact all other partitions that
> are
> > on
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> the
> > > > >>>>
> > > > >>>>> disk.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> In general, when we encounter I/O errors, we take the
> > disk
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> partition
> > > > >>>>>>
> > > > >>>>>>> offline.  This is spelled out in KIP-112 (
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%
> > > > >>>> 3A+Handle+disk+failure+for+JBOD
> > > > >>>>
> > > > >>>>> ) :
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> - Broker assumes a log directory to be good after it
> > > starts,
> > > > >>>>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> mark
> > > > >>>>>>>>>
> > > > >>>>>>>>>> log directory as
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> bad once there is IOException when broker attempts to
> > > access
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> (i.e.
> > > > >>>>
> > > > >>>>> read
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> or write) the log directory.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> - Broker will be offline if all log directories are
> bad.
> > > > >>>>>>>>>>>>>> - Broker will stop serving replicas in any bad log
> > > > directory.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> New
> > > > >>>
> > > > >>>
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Stanislav
> > >
> >
>
>
> --
> Best,
> Stanislav
>

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Stanislav Kozlovski <st...@confluent.io>.

Hey Jason,

My initial suggestion was to *track *the uncleanable disk space.
I can see why marking a log directory as offline after a certain threshold
of uncleanable disk space is more useful.
I'm not sure if we can set that threshold to be of certain size (e.g 100GB)
as log directories might have different sizes.  Maybe a percentage would be
better then (e.g 30% of whole log dir size), WDYT?

I feel it still makes sense to have a metric tracking how many uncleanable
partitions there are and the total amount of uncleanable disk space (per
log dir, via a JMX tag).
But now, rather than fail the log directory after a certain count of
uncleanable partitions, we could fail it after a certain percentage (or
size) of its storage is uncleanable.

I'd like to hear other people's thoughts on this. Sound good?

Best,
Stanislav




On Fri, Aug 10, 2018 at 12:40 AM Jason Gustafson <ja...@confluent.io> wrote:

> Hey Stanislav,
>
> Sorry, I was probably looking at an older version (I had the tab open for
> so long!).
>
> I have been thinking about `max.uncleanable.partitions` and wondering if
> it's what we really want. The main risk if the cleaner cannot clean a
> partition is eventually running out of disk space. This is the most common
> problem we have seen with cleaner failures and it can happen even if there
> is just one uncleanable partition. We've actually seen cases in which a
> single __consumer_offsets grew large enough to fill a significant portion
> of the disk. The difficulty with allowing a system to run out of disk space
> before failing is that it makes recovery difficult and time consuming.
> Clean shutdown, for example, requires writing some state to disk. Without
> clean shutdown, it can take the broker significantly longer to startup
> because it has do more segment recovery.
>
> For this problem, `max.uncleanable.partitions` does not really help. You
> can set it to 1 and fail fast, but that is not much better than the
> existing state. You had a suggestion previously in the KIP to use the size
> of uncleanable disk space instead. What was the reason for rejecting that?
> Intuitively, it seems like a better fit for a cleaner failure. It would
> provide users some time to react to failures while still protecting them
> from exhausting the disk.
>
> Thanks,
> Jason
>
>
>
>
> On Thu, Aug 9, 2018 at 9:46 AM, Stanislav Kozlovski <
> stanislav@confluent.io>
> wrote:
>
> > Hey Jason,
> >
> > 1. *10* is the default value, it says so in the KIP
> > 2. This is a good catch. As the current implementation stands, it's not a
> > useful metric since the thread continues to run even if all log
> directories
> > are offline (although I'm not sure what the broker's behavior is in that
> > scenario). I'll make sure the thread stops if all log directories are
> > online.
> >
> > I don't know which "Needs Discussion" item you're referencing, there
> hasn't
> > been any in the KIP since August 1 and that was for the metric only. KIP
> > History
> > <https://cwiki.apache.org/confluence/pages/viewpreviousversions.action?
> > pageId=89064875>
> >
> > I've updated the KIP to mention the "time-since-last-run" metric.
> >
> > Thanks,
> > Stanislav
> >
> > On Wed, Aug 8, 2018 at 12:12 AM Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > Hi Stanislav,
> > >
> > > Just a couple quick questions:
> > >
> > > 1. I may have missed it, but what will be the default value for
> > > `max.uncleanable.partitions`?
> > > 2. It seems there will be some impact for users that monitoring
> > > "time-since-last-run-ms" in order to detect cleaner failures. Not sure
> > it's
> > > a major concern, but probably worth mentioning in the compatibility
> > > section. Also, is this still a useful metric after this KIP?
> > >
> > > Also, maybe the "Needs Discussion" item can be moved to rejected
> > > alternatives since you've moved to a vote? I think leaving this for
> > > potential future work is reasonable.
> > >
> > > Thanks,
> > > Jason
> > >
> > >
> > > On Mon, Aug 6, 2018 at 12:29 PM, Ray Chiang <rc...@apache.org>
> wrote:
> > >
> > > > I'm okay with that.
> > > >
> > > > -Ray
> > > >
> > > > On 8/6/18 10:59 AM, Colin McCabe wrote:
> > > >
> > > >> Perhaps we could start with max.uncleanable.partitions and then
> > > implement
> > > >> max.uncleanable.partitions.per.logdir in a follow-up change if it
> > seemed
> > > >> to be necessary?  What do you think?
> > > >>
> > > >> regards,
> > > >> Colin
> > > >>
> > > >>
> > > >> On Sat, Aug 4, 2018, at 10:53, Stanislav Kozlovski wrote:
> > > >>
> > > >>> Hey Ray,
> > > >>>
> > > >>> Thanks for the explanation. In regards to the configuration
> property
> > -
> > > >>> I'm
> > > >>> not sure. As long as it has sufficient documentation, I find
> > > >>> "max.uncleanable.partitions" to be okay. If we were to add the
> > > >>> distinction
> > > >>> explicitly, maybe it should be `max.uncleanable.partitions.
> > per.logdir`
> > > ?
> > > >>>
> > > >>> On Thu, Aug 2, 2018 at 7:32 PM Ray Chiang <rc...@apache.org>
> > wrote:
> > > >>>
> > > >>> One more thing occurred to me.  Should the configuration property
> be
> > > >>>> named "max.uncleanable.partitions.per.disk" instead?
> > > >>>>
> > > >>>> -Ray
> > > >>>>
> > > >>>>
> > > >>>> On 8/1/18 9:11 AM, Stanislav Kozlovski wrote:
> > > >>>>
> > > >>>>> Yes, good catch. Thank you, James!
> > > >>>>>
> > > >>>>> Best,
> > > >>>>> Stanislav
> > > >>>>>
> > > >>>>> On Wed, Aug 1, 2018 at 5:05 PM James Cheng <wushujames@gmail.com
> >
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>> Can you update the KIP to say what the default is for
> > > >>>>>> max.uncleanable.partitions?
> > > >>>>>>
> > > >>>>>> -James
> > > >>>>>>
> > > >>>>>> Sent from my iPhone
> > > >>>>>>
> > > >>>>>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <
> > > >>>>>>>
> > > >>>>>> stanislav@confluent.io>
> > > >>>>
> > > >>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Hey group,
> > > >>>>>>>
> > > >>>>>>> I am planning on starting a voting thread tomorrow. Please do
> > reply
> > > >>>>>>> if
> > > >>>>>>>
> > > >>>>>> you
> > > >>>>>>
> > > >>>>>>> feel there is anything left to discuss.
> > > >>>>>>>
> > > >>>>>>> Best,
> > > >>>>>>> Stanislav
> > > >>>>>>>
> > > >>>>>>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
> > > >>>>>>>
> > > >>>>>> stanislav@confluent.io>
> > > >>>>>>
> > > >>>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>> Hey, Ray
> > > >>>>>>>>
> > > >>>>>>>> Thanks for pointing that out, it's fixed now
> > > >>>>>>>>
> > > >>>>>>>> Best,
> > > >>>>>>>> Stanislav
> > > >>>>>>>>
> > > >>>>>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <
> rchiang@apache.org>
> > > >>>>>>>>>
> > > >>>>>>>> wrote:
> > > >>>>
> > > >>>>> Thanks.  Can you fix the link in the "KIPs under discussion"
> table
> > on
> > > >>>>>>>>> the main KIP landing page
> > > >>>>>>>>> <
> > > >>>>>>>>>
> > > >>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+
> > > >>>> Improvement+Proposals#
> > > >>>>
> > > >>>>> ?
> > > >>>>>>>
> > > >>>>>>>> I tried, but the Wiki won't let me.
> > > >>>>>>>>>
> > > >>>>>>>>> -Ray
> > > >>>>>>>>>
> > > >>>>>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
> > > >>>>>>>>>> Hey guys,
> > > >>>>>>>>>>
> > > >>>>>>>>>> @Colin - good point. I added some sentences mentioning
> recent
> > > >>>>>>>>>>
> > > >>>>>>>>> improvements
> > > >>>>>>>>>
> > > >>>>>>>>>> in the introductory section.
> > > >>>>>>>>>>
> > > >>>>>>>>>> *Disk Failure* - I tend to agree with what Colin said -
> once a
> > > >>>>>>>>>> disk
> > > >>>>>>>>>>
> > > >>>>>>>>> fails,
> > > >>>>>>>>>
> > > >>>>>>>>>> you don't want to work with it again. As such, I've changed
> my
> > > >>>>>>>>>> mind
> > > >>>>>>>>>>
> > > >>>>>>>>> and
> > > >>>>>>
> > > >>>>>>> believe that we should mark the LogDir (assume its a disk) as
> > > >>>>>>>>>>
> > > >>>>>>>>> offline
> > > >>>>
> > > >>>>> on
> > > >>>>>>
> > > >>>>>>> the first `IOException` encountered. This is the LogCleaner's
> > > >>>>>>>>>>
> > > >>>>>>>>> current
> > > >>>>
> > > >>>>> behavior. We shouldn't change that.
> > > >>>>>>>>>>
> > > >>>>>>>>>> *Respawning Threads* - I believe we should never re-spawn a
> > > >>>>>>>>>> thread.
> > > >>>>>>>>>>
> > > >>>>>>>>> The
> > > >>>>>>
> > > >>>>>>> correct approach in my mind is to either have it stay dead or
> > never
> > > >>>>>>>>>>
> > > >>>>>>>>> let
> > > >>>>>>
> > > >>>>>>> it
> > > >>>>>>>>>
> > > >>>>>>>>>> die in the first place.
> > > >>>>>>>>>>
> > > >>>>>>>>>> *Uncleanable-partition-names metric* - Colin is right, this
> > > metric
> > > >>>>>>>>>>
> > > >>>>>>>>> is
> > > >>>>
> > > >>>>> unneeded. Users can monitor the `uncleanable-partitions-count`
> > > >>>>>>>>>>
> > > >>>>>>>>> metric
> > > >>>>
> > > >>>>> and
> > > >>>>>>>>>
> > > >>>>>>>>>> inspect logs.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Hey Ray,
> > > >>>>>>>>>>
> > > >>>>>>>>>> 2) I'm 100% with James in agreement with setting up the
> > > LogCleaner
> > > >>>>>>>>>>>
> > > >>>>>>>>>> to
> > > >>>>
> > > >>>>> skip over problematic partitions instead of dying.
> > > >>>>>>>>>>>
> > > >>>>>>>>>> I think we can do this for every exception that isn't
> > > >>>>>>>>>> `IOException`.
> > > >>>>>>>>>>
> > > >>>>>>>>> This
> > > >>>>>>>>>
> > > >>>>>>>>>> will future-proof us against bugs in the system and
> potential
> > > >>>>>>>>>> other
> > > >>>>>>>>>>
> > > >>>>>>>>> errors.
> > > >>>>>>>>>
> > > >>>>>>>>>> Protecting yourself against unexpected failures is always a
> > good
> > > >>>>>>>>>>
> > > >>>>>>>>> thing
> > > >>>>
> > > >>>>> in
> > > >>>>>>>>>
> > > >>>>>>>>>> my mind, but I also think that protecting yourself against
> > bugs
> > > in
> > > >>>>>>>>>>
> > > >>>>>>>>> the
> > > >>>>
> > > >>>>> software is sort of clunky. What does everybody think about this?
> > > >>>>>>>>>>
> > > >>>>>>>>>> 4) The only improvement I can think of is that if such an
> > > >>>>>>>>>>> error occurs, then have the option (configuration setting?)
> > to
> > > >>>>>>>>>>>
> > > >>>>>>>>>> create a
> > > >>>>>>
> > > >>>>>>> <log_segment>.skip file (or something similar).
> > > >>>>>>>>>>>
> > > >>>>>>>>>> This is a good suggestion. Have others also seen corruption
> be
> > > >>>>>>>>>>
> > > >>>>>>>>> generally
> > > >>>>>>
> > > >>>>>>> tied to the same segment?
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <
> > > >>>>>>>>>> dhruvil@confluent.io
> > > >>>>>>>>>>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> For the cleaner thread specifically, I do not think
> respawning
> > > >>>>>>>>>>> will
> > > >>>>>>>>>>>
> > > >>>>>>>>>> help at
> > > >>>>>>>>>
> > > >>>>>>>>>> all because we are more than likely to run into the same
> issue
> > > >>>>>>>>>>>
> > > >>>>>>>>>> again
> > > >>>>
> > > >>>>> which
> > > >>>>>>>>>
> > > >>>>>>>>>> would end up crashing the cleaner. Retrying makes sense for
> > > >>>>>>>>>>>
> > > >>>>>>>>>> transient
> > > >>>>
> > > >>>>> errors or when you believe some part of the system could have
> > > >>>>>>>>>>>
> > > >>>>>>>>>> healed
> > > >>>>
> > > >>>>> itself, both of which I think are not true for the log cleaner.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <
> > > >>>>>>>>>>> rndgstn@gmail.com>
> > > >>>>>>>>>>>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> <<<respawning threads is likely to make things worse, by
> > putting
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> you
> > > >>>>
> > > >>>>> in
> > > >>>>>>>>>
> > > >>>>>>>>>> an
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> infinite loop which consumes resources and fires off
> > > continuous
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> log
> > > >>>>
> > > >>>>> messages.
> > > >>>>>>>>>>>> Hi Colin.  In case it could be relevant, one way to
> mitigate
> > > >>>>>>>>>>>> this
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> effect
> > > >>>>>>>>>
> > > >>>>>>>>>> is
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> to implement a backoff mechanism (if a second respawn is
> to
> > > >>>>>>>>>>>> occur
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> then
> > > >>>>>>
> > > >>>>>>> wait
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> for 1 minute before doing it; then if a third respawn is
> to
> > > >>>>>>>>>>>> occur
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> wait
> > > >>>>>>
> > > >>>>>>> for
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc.
> > up
> > > to
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> some
> > > >>>>>>
> > > >>>>>>> max
> > > >>>>>>>>>
> > > >>>>>>>>>> wait time).
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> I have no opinion on whether respawn is appropriate or not
> > in
> > > >>>>>>>>>>>> this
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> context,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> but a mitigation like the increasing backoff described
> above
> > > may
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> be
> > > >>>>
> > > >>>>> relevant in weighing the pros and cons.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Ron
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <
> > > >>>>>>>>>>>> cmccabe@apache.org>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> > > >>>>>>>>>>>>>> Hi Stanislav! Thanks for this KIP!
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> I agree that it would be good if the LogCleaner were
> more
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> tolerant
> > > >>>>
> > > >>>>> of
> > > >>>>>>>>>
> > > >>>>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Things are better now than they used to be. We have the
> > > metric
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> kafka.log:type=LogCleanerManager,name=time-
> > since-last-run-ms
> > > >>>>
> > > >>>>> which we can use to tell us if the threads are dead. And as of
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> 1.1.0,
> > > >>>>>>>>>
> > > >>>>>>>>>> we
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> have KIP-226, which allows you to restart the log cleaner
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> thread,
> > > >>>>
> > > >>>>> without requiring a broker restart.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+
> > > >>>> Dynamic+Broker+Configuration
> > > >>>>
> > > >>>>> <
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 226+-+
> > > >>>> Dynamic+Broker+Configuration
> > > >>>>
> > > >>>>> I've only read about this, I haven't personally tried it.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Thanks for pointing this out, James!  Stanislav, we
> should
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> probably
> > > >>>>
> > > >>>>> add a
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere
> in
> > > the
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> KIP.
> > > >>>>>>
> > > >>>>>>> Maybe
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> in the intro section?
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> I think it's clear that requiring the users to manually
> > > restart
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> the
> > > >>>>
> > > >>>>> log
> > > >>>>>>>>>
> > > >>>>>>>>>> cleaner is not a very good solution.  But it's good to know
> > that
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> it's a
> > > >>>>>>>>>
> > > >>>>>>>>>> possibility on some older releases.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Some comments:
> > > >>>>>>>>>>>>>> * I like the idea of having the log cleaner continue to
> > > clean
> > > >>>>>>>>>>>>>> as
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> many
> > > >>>>>>>>>
> > > >>>>>>>>>> partitions as it can, skipping over the problematic ones if
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> possible.
> > > >>>>>>>>>
> > > >>>>>>>>>> * If the log cleaner thread dies, I think it should
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> automatically
> > > >>>>
> > > >>>>> be
> > > >>>>>>
> > > >>>>>>> revived. Your KIP attempts to do that by catching exceptions
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> during
> > > >>>>>>
> > > >>>>>>> execution, but I think we should go all the way and make sure
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> that a
> > > >>>>>>
> > > >>>>>>> new
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> one gets created, if the thread ever dies.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> This is inconsistent with the way the rest of Kafka
> works.
> > > We
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> don't
> > > >>>>>>
> > > >>>>>>> automatically re-create other threads in the broker if they
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> terminate.
> > > >>>>>>>>>
> > > >>>>>>>>>> In
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> general, if there is a serious bug in the code,
> respawning
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> threads
> > > >>>>
> > > >>>>> is
> > > >>>>>>
> > > >>>>>>> likely to make things worse, by putting you in an infinite loop
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> which
> > > >>>>>>
> > > >>>>>>> consumes resources and fires off continuous log messages.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> * It might be worth trying to re-clean the uncleanable
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> partitions.
> > > >>>>
> > > >>>>> I've
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> seen cases where an uncleanable partition later became
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> cleanable.
> > > >>>>
> > > >>>>> I
> > > >>>>>>
> > > >>>>>>> unfortunately don't remember how that happened, but I remember
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> being
> > > >>>>>>
> > > >>>>>>> surprised when I discovered it. It might have been something
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> like
> > > >>>>
> > > >>>>> a
> > > >>>>>>
> > > >>>>>>> follower was uncleanable but after a leader election happened,
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> the
> > > >>>>
> > > >>>>> log
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> truncated and it was then cleanable again. I'm not sure.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> James, I disagree.  We had this behavior in the Hadoop
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> Distributed
> > > >>>>
> > > >>>>> File
> > > >>>>>>>>>
> > > >>>>>>>>>> System (HDFS) and it was a constant source of user problems.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> What would happen is disks would just go bad over time.
> > The
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> DataNode
> > > >>>>>>
> > > >>>>>>> would notice this and take them offline.  But then, due to some
> > > >>>>>>>>>>>>> "optimistic" code, the DataNode would periodically try to
> > > >>>>>>>>>>>>> re-add
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> them
> > > >>>>>>
> > > >>>>>>> to
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> the system.  Then one of two things would happen: the disk
> > > would
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> just
> > > >>>>>>
> > > >>>>>>> fail
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> immediately again, or it would appear to work and then
> fail
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> after a
> > > >>>>
> > > >>>>> short
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> amount of time.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> The way the disk failed was normally having an I/O
> request
> > > >>>>>>>>>>>>> take a
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> really
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> long time and time out.  So a bunch of request handler
> > threads
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> would
> > > >>>>>>
> > > >>>>>>> basically slam into a brick wall when they tried to access the
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> bad
> > > >>>>
> > > >>>>> disk,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> slowing the DataNode to a crawl.  It was even worse in the
> > > >>>>>>>>>>>>> second
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> scenario,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> if the disk appeared to work for a while, but then
> failed.
> > > Any
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> data
> > > >>>>>>
> > > >>>>>>> that
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> had been written on that DataNode to that disk would be
> > lost,
> > > >>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> we
> > > >>>>>>
> > > >>>>>>> would
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> need to re-replicate it.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Disks aren't biological systems-- they don't heal over
> > time.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> Once
> > > >>>>
> > > >>>>> they're
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> bad, they stay bad.  The log cleaner needs to be robust
> > > against
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> cases
> > > >>>>>>
> > > >>>>>>> where
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> the disk really is failing, and really is returning bad
> > data
> > > or
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> timing
> > > >>>>>>>>>
> > > >>>>>>>>>> out.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> * For your metrics, can you spell out the full metric in
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> JMX-style
> > > >>>>
> > > >>>>> format, such as:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>   kafka.log:type=LogCleanerManager,name=uncleanable-
> > > >>>> partitions-count
> > > >>>>
> > > >>>>>                 value=4
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> * For "uncleanable-partitions": topic-partition names
> can
> > be
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> very
> > > >>>>
> > > >>>>> long.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> I think the current max size is 210 characters (or maybe
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> 240-ish?).
> > > >>>>>>
> > > >>>>>>> Having the "uncleanable-partitions" being a list could be very
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> large
> > > >>>>>>
> > > >>>>>>> metric. Also, having the metric come out as a csv might be
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> difficult
> > > >>>>>>
> > > >>>>>>> to
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> work with for monitoring systems. If we *did* want the
> topic
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> names
> > > >>>>
> > > >>>>> to
> > > >>>>>>>>>
> > > >>>>>>>>>> be
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> accessible, what do you think of having the
> > > >>>>>>>>>>>>>>         kafka.log:type=LogCleanerManag
> > > >>>>>>>>>>>>>> er,topic=topic1,partition=2
> > > >>>>>>>>>>>>>> I'm not sure if LogCleanerManager is the right type, but
> > my
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> example
> > > >>>>>>
> > > >>>>>>> was
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> that the topic and partition can be tags in the metric.
> That
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> will
> > > >>>>
> > > >>>>> allow
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> monitoring systems to more easily slice and dice the
> metric.
> > > I'm
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> not
> > > >>>>>>
> > > >>>>>>> sure what the attribute for that metric would be. Maybe
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> something
> > > >>>>
> > > >>>>> like
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> time-since-last-clean?
> > > >>>>>>>>>
> > > >>>>>>>>>> Or
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> maybe even just "Value=1".
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> I haven't though about this that hard, but do we really
> > need
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>> uncleanable topic names to be accessible through a
> metric?
> > > It
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> seems
> > > >>>>>>
> > > >>>>>>> like
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> the admin should notice that uncleanable partitions are
> > > present,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> and
> > > >>>>>>
> > > >>>>>>> then
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> check the logs?
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> * About `max.uncleanable.partitions`, you said that this
> > > likely
> > > >>>>>>>>>>>>>> indicates that the disk is having problems. I'm not sure
> > > that
> > > >>>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> the
> > > >>>>>>
> > > >>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> problems,
> > > >>>>>>
> > > >>>>>>> all
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> of them are partition-level scenarios that happened
> during
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> normal
> > > >>>>
> > > >>>>> operation. None of them were indicative of disk problems.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> I don't think this is a meaningful comparison.  In
> general,
> > > we
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> don't
> > > >>>>>>
> > > >>>>>>> accept JIRAs for hard disk problems that happen on a particular
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> cluster.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> If someone opened a JIRA that said "my hard disk is having
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> problems"
> > > >>>>>>
> > > >>>>>>> we
> > > >>>>>>>>>
> > > >>>>>>>>>> could close that as "not a Kafka bug."  This doesn't prove
> > that
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> disk
> > > >>>>>>
> > > >>>>>>> problems don't happen, but  just that JIRA isn't the right
> place
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> for
> > > >>>>>>
> > > >>>>>>> them.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> I do agree that the log cleaner has had a significant
> > number
> > > of
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> logic
> > > >>>>>>
> > > >>>>>>> bugs, and that we need to be careful to limit their impact.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> That's
> > > >>>>
> > > >>>>> one
> > > >>>>>>>>>
> > > >>>>>>>>>> reason why I think that a threshold of "number of
> uncleanable
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> logs"
> > > >>>>
> > > >>>>> is
> > > >>>>>>>>>
> > > >>>>>>>>>> a
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> good idea, rather than just failing after one IOException.
> > In
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> all
> > > >>>>
> > > >>>>> the
> > > >>>>>>>>>
> > > >>>>>>>>>> cases I've seen where a user hit a logic bug in the log
> > cleaner,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> it
> > > >>>>
> > > >>>>> was
> > > >>>>>>>>>
> > > >>>>>>>>>> just one partition that had the issue.  We also should
> > increase
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> test
> > > >>>>>>
> > > >>>>>>> coverage for the log cleaner.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> * About marking disks as offline when exceeding a certain
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> threshold,
> > > >>>>>>
> > > >>>>>>> that actually increases the blast radius of log compaction
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> failures.
> > > >>>>>>
> > > >>>>>>> Currently, the uncleaned partitions are still readable and
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> writable.
> > > >>>>>>
> > > >>>>>>> Taking the disks offline would impact availability of the
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> uncleanable
> > > >>>>>>>>>
> > > >>>>>>>>>> partitions, as well as impact all other partitions that are
> on
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> the
> > > >>>>
> > > >>>>> disk.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> In general, when we encounter I/O errors, we take the
> disk
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> partition
> > > >>>>>>
> > > >>>>>>> offline.  This is spelled out in KIP-112 (
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%
> > > >>>> 3A+Handle+disk+failure+for+JBOD
> > > >>>>
> > > >>>>> ) :
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> - Broker assumes a log directory to be good after it
> > starts,
> > > >>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> mark
> > > >>>>>>>>>
> > > >>>>>>>>>> log directory as
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> bad once there is IOException when broker attempts to
> > access
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> (i.e.
> > > >>>>
> > > >>>>> read
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> or write) the log directory.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> - Broker will be offline if all log directories are bad.
> > > >>>>>>>>>>>>>> - Broker will stop serving replicas in any bad log
> > > directory.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> New
> > > >>>
> > > >>>
> > >
> >
> >
> > --
> > Best,
> > Stanislav
> >
>


-- 
Best,
Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Jason Gustafson <ja...@confluent.io>.

Hey Stanislav,

Sorry, I was probably looking at an older version (I had the tab open for
so long!).

I have been thinking about `max.uncleanable.partitions` and wondering if
it's what we really want. The main risk if the cleaner cannot clean a
partition is eventually running out of disk space. This is the most common
problem we have seen with cleaner failures and it can happen even if there
is just one uncleanable partition. We've actually seen cases in which a
single __consumer_offsets grew large enough to fill a significant portion
of the disk. The difficulty with allowing a system to run out of disk space
before failing is that it makes recovery difficult and time consuming.
Clean shutdown, for example, requires writing some state to disk. Without
clean shutdown, it can take the broker significantly longer to startup
because it has do more segment recovery.

For this problem, `max.uncleanable.partitions` does not really help. You
can set it to 1 and fail fast, but that is not much better than the
existing state. You had a suggestion previously in the KIP to use the size
of uncleanable disk space instead. What was the reason for rejecting that?
Intuitively, it seems like a better fit for a cleaner failure. It would
provide users some time to react to failures while still protecting them
from exhausting the disk.

Thanks,
Jason




On Thu, Aug 9, 2018 at 9:46 AM, Stanislav Kozlovski <st...@confluent.io>
wrote:

> Hey Jason,
>
> 1. *10* is the default value, it says so in the KIP
> 2. This is a good catch. As the current implementation stands, it's not a
> useful metric since the thread continues to run even if all log directories
> are offline (although I'm not sure what the broker's behavior is in that
> scenario). I'll make sure the thread stops if all log directories are
> online.
>
> I don't know which "Needs Discussion" item you're referencing, there hasn't
> been any in the KIP since August 1 and that was for the metric only. KIP
> History
> <https://cwiki.apache.org/confluence/pages/viewpreviousversions.action?
> pageId=89064875>
>
> I've updated the KIP to mention the "time-since-last-run" metric.
>
> Thanks,
> Stanislav
>
> On Wed, Aug 8, 2018 at 12:12 AM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hi Stanislav,
> >
> > Just a couple quick questions:
> >
> > 1. I may have missed it, but what will be the default value for
> > `max.uncleanable.partitions`?
> > 2. It seems there will be some impact for users that monitoring
> > "time-since-last-run-ms" in order to detect cleaner failures. Not sure
> it's
> > a major concern, but probably worth mentioning in the compatibility
> > section. Also, is this still a useful metric after this KIP?
> >
> > Also, maybe the "Needs Discussion" item can be moved to rejected
> > alternatives since you've moved to a vote? I think leaving this for
> > potential future work is reasonable.
> >
> > Thanks,
> > Jason
> >
> >
> > On Mon, Aug 6, 2018 at 12:29 PM, Ray Chiang <rc...@apache.org> wrote:
> >
> > > I'm okay with that.
> > >
> > > -Ray
> > >
> > > On 8/6/18 10:59 AM, Colin McCabe wrote:
> > >
> > >> Perhaps we could start with max.uncleanable.partitions and then
> > implement
> > >> max.uncleanable.partitions.per.logdir in a follow-up change if it
> seemed
> > >> to be necessary?  What do you think?
> > >>
> > >> regards,
> > >> Colin
> > >>
> > >>
> > >> On Sat, Aug 4, 2018, at 10:53, Stanislav Kozlovski wrote:
> > >>
> > >>> Hey Ray,
> > >>>
> > >>> Thanks for the explanation. In regards to the configuration property
> -
> > >>> I'm
> > >>> not sure. As long as it has sufficient documentation, I find
> > >>> "max.uncleanable.partitions" to be okay. If we were to add the
> > >>> distinction
> > >>> explicitly, maybe it should be `max.uncleanable.partitions.
> per.logdir`
> > ?
> > >>>
> > >>> On Thu, Aug 2, 2018 at 7:32 PM Ray Chiang <rc...@apache.org>
> wrote:
> > >>>
> > >>> One more thing occurred to me.  Should the configuration property be
> > >>>> named "max.uncleanable.partitions.per.disk" instead?
> > >>>>
> > >>>> -Ray
> > >>>>
> > >>>>
> > >>>> On 8/1/18 9:11 AM, Stanislav Kozlovski wrote:
> > >>>>
> > >>>>> Yes, good catch. Thank you, James!
> > >>>>>
> > >>>>> Best,
> > >>>>> Stanislav
> > >>>>>
> > >>>>> On Wed, Aug 1, 2018 at 5:05 PM James Cheng <wu...@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>> Can you update the KIP to say what the default is for
> > >>>>>> max.uncleanable.partitions?
> > >>>>>>
> > >>>>>> -James
> > >>>>>>
> > >>>>>> Sent from my iPhone
> > >>>>>>
> > >>>>>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <
> > >>>>>>>
> > >>>>>> stanislav@confluent.io>
> > >>>>
> > >>>>> wrote:
> > >>>>>>
> > >>>>>>> Hey group,
> > >>>>>>>
> > >>>>>>> I am planning on starting a voting thread tomorrow. Please do
> reply
> > >>>>>>> if
> > >>>>>>>
> > >>>>>> you
> > >>>>>>
> > >>>>>>> feel there is anything left to discuss.
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Stanislav
> > >>>>>>>
> > >>>>>>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
> > >>>>>>>
> > >>>>>> stanislav@confluent.io>
> > >>>>>>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>> Hey, Ray
> > >>>>>>>>
> > >>>>>>>> Thanks for pointing that out, it's fixed now
> > >>>>>>>>
> > >>>>>>>> Best,
> > >>>>>>>> Stanislav
> > >>>>>>>>
> > >>>>>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rc...@apache.org>
> > >>>>>>>>>
> > >>>>>>>> wrote:
> > >>>>
> > >>>>> Thanks.  Can you fix the link in the "KIPs under discussion" table
> on
> > >>>>>>>>> the main KIP landing page
> > >>>>>>>>> <
> > >>>>>>>>>
> > >>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+
> > >>>> Improvement+Proposals#
> > >>>>
> > >>>>> ?
> > >>>>>>>
> > >>>>>>>> I tried, but the Wiki won't let me.
> > >>>>>>>>>
> > >>>>>>>>> -Ray
> > >>>>>>>>>
> > >>>>>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
> > >>>>>>>>>> Hey guys,
> > >>>>>>>>>>
> > >>>>>>>>>> @Colin - good point. I added some sentences mentioning recent
> > >>>>>>>>>>
> > >>>>>>>>> improvements
> > >>>>>>>>>
> > >>>>>>>>>> in the introductory section.
> > >>>>>>>>>>
> > >>>>>>>>>> *Disk Failure* - I tend to agree with what Colin said - once a
> > >>>>>>>>>> disk
> > >>>>>>>>>>
> > >>>>>>>>> fails,
> > >>>>>>>>>
> > >>>>>>>>>> you don't want to work with it again. As such, I've changed my
> > >>>>>>>>>> mind
> > >>>>>>>>>>
> > >>>>>>>>> and
> > >>>>>>
> > >>>>>>> believe that we should mark the LogDir (assume its a disk) as
> > >>>>>>>>>>
> > >>>>>>>>> offline
> > >>>>
> > >>>>> on
> > >>>>>>
> > >>>>>>> the first `IOException` encountered. This is the LogCleaner's
> > >>>>>>>>>>
> > >>>>>>>>> current
> > >>>>
> > >>>>> behavior. We shouldn't change that.
> > >>>>>>>>>>
> > >>>>>>>>>> *Respawning Threads* - I believe we should never re-spawn a
> > >>>>>>>>>> thread.
> > >>>>>>>>>>
> > >>>>>>>>> The
> > >>>>>>
> > >>>>>>> correct approach in my mind is to either have it stay dead or
> never
> > >>>>>>>>>>
> > >>>>>>>>> let
> > >>>>>>
> > >>>>>>> it
> > >>>>>>>>>
> > >>>>>>>>>> die in the first place.
> > >>>>>>>>>>
> > >>>>>>>>>> *Uncleanable-partition-names metric* - Colin is right, this
> > metric
> > >>>>>>>>>>
> > >>>>>>>>> is
> > >>>>
> > >>>>> unneeded. Users can monitor the `uncleanable-partitions-count`
> > >>>>>>>>>>
> > >>>>>>>>> metric
> > >>>>
> > >>>>> and
> > >>>>>>>>>
> > >>>>>>>>>> inspect logs.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Hey Ray,
> > >>>>>>>>>>
> > >>>>>>>>>> 2) I'm 100% with James in agreement with setting up the
> > LogCleaner
> > >>>>>>>>>>>
> > >>>>>>>>>> to
> > >>>>
> > >>>>> skip over problematic partitions instead of dying.
> > >>>>>>>>>>>
> > >>>>>>>>>> I think we can do this for every exception that isn't
> > >>>>>>>>>> `IOException`.
> > >>>>>>>>>>
> > >>>>>>>>> This
> > >>>>>>>>>
> > >>>>>>>>>> will future-proof us against bugs in the system and potential
> > >>>>>>>>>> other
> > >>>>>>>>>>
> > >>>>>>>>> errors.
> > >>>>>>>>>
> > >>>>>>>>>> Protecting yourself against unexpected failures is always a
> good
> > >>>>>>>>>>
> > >>>>>>>>> thing
> > >>>>
> > >>>>> in
> > >>>>>>>>>
> > >>>>>>>>>> my mind, but I also think that protecting yourself against
> bugs
> > in
> > >>>>>>>>>>
> > >>>>>>>>> the
> > >>>>
> > >>>>> software is sort of clunky. What does everybody think about this?
> > >>>>>>>>>>
> > >>>>>>>>>> 4) The only improvement I can think of is that if such an
> > >>>>>>>>>>> error occurs, then have the option (configuration setting?)
> to
> > >>>>>>>>>>>
> > >>>>>>>>>> create a
> > >>>>>>
> > >>>>>>> <log_segment>.skip file (or something similar).
> > >>>>>>>>>>>
> > >>>>>>>>>> This is a good suggestion. Have others also seen corruption be
> > >>>>>>>>>>
> > >>>>>>>>> generally
> > >>>>>>
> > >>>>>>> tied to the same segment?
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <
> > >>>>>>>>>> dhruvil@confluent.io
> > >>>>>>>>>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> For the cleaner thread specifically, I do not think respawning
> > >>>>>>>>>>> will
> > >>>>>>>>>>>
> > >>>>>>>>>> help at
> > >>>>>>>>>
> > >>>>>>>>>> all because we are more than likely to run into the same issue
> > >>>>>>>>>>>
> > >>>>>>>>>> again
> > >>>>
> > >>>>> which
> > >>>>>>>>>
> > >>>>>>>>>> would end up crashing the cleaner. Retrying makes sense for
> > >>>>>>>>>>>
> > >>>>>>>>>> transient
> > >>>>
> > >>>>> errors or when you believe some part of the system could have
> > >>>>>>>>>>>
> > >>>>>>>>>> healed
> > >>>>
> > >>>>> itself, both of which I think are not true for the log cleaner.
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <
> > >>>>>>>>>>> rndgstn@gmail.com>
> > >>>>>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> <<<respawning threads is likely to make things worse, by
> putting
> > >>>>>>>>>>>>
> > >>>>>>>>>>> you
> > >>>>
> > >>>>> in
> > >>>>>>>>>
> > >>>>>>>>>> an
> > >>>>>>>>>>>
> > >>>>>>>>>>>> infinite loop which consumes resources and fires off
> > continuous
> > >>>>>>>>>>>>
> > >>>>>>>>>>> log
> > >>>>
> > >>>>> messages.
> > >>>>>>>>>>>> Hi Colin.  In case it could be relevant, one way to mitigate
> > >>>>>>>>>>>> this
> > >>>>>>>>>>>>
> > >>>>>>>>>>> effect
> > >>>>>>>>>
> > >>>>>>>>>> is
> > >>>>>>>>>>>
> > >>>>>>>>>>>> to implement a backoff mechanism (if a second respawn is to
> > >>>>>>>>>>>> occur
> > >>>>>>>>>>>>
> > >>>>>>>>>>> then
> > >>>>>>
> > >>>>>>> wait
> > >>>>>>>>>>>
> > >>>>>>>>>>>> for 1 minute before doing it; then if a third respawn is to
> > >>>>>>>>>>>> occur
> > >>>>>>>>>>>>
> > >>>>>>>>>>> wait
> > >>>>>>
> > >>>>>>> for
> > >>>>>>>>>>>
> > >>>>>>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc.
> up
> > to
> > >>>>>>>>>>>>
> > >>>>>>>>>>> some
> > >>>>>>
> > >>>>>>> max
> > >>>>>>>>>
> > >>>>>>>>>> wait time).
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I have no opinion on whether respawn is appropriate or not
> in
> > >>>>>>>>>>>> this
> > >>>>>>>>>>>>
> > >>>>>>>>>>> context,
> > >>>>>>>>>>>
> > >>>>>>>>>>>> but a mitigation like the increasing backoff described above
> > may
> > >>>>>>>>>>>>
> > >>>>>>>>>>> be
> > >>>>
> > >>>>> relevant in weighing the pros and cons.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Ron
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <
> > >>>>>>>>>>>> cmccabe@apache.org>
> > >>>>>>>>>>>>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> > >>>>>>>>>>>>>> Hi Stanislav! Thanks for this KIP!
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I agree that it would be good if the LogCleaner were more
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> tolerant
> > >>>>
> > >>>>> of
> > >>>>>>>>>
> > >>>>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Things are better now than they used to be. We have the
> > metric
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> kafka.log:type=LogCleanerManager,name=time-
> since-last-run-ms
> > >>>>
> > >>>>> which we can use to tell us if the threads are dead. And as of
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> 1.1.0,
> > >>>>>>>>>
> > >>>>>>>>>> we
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> have KIP-226, which allows you to restart the log cleaner
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> thread,
> > >>>>
> > >>>>> without requiring a broker restart.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+
> > >>>> Dynamic+Broker+Configuration
> > >>>>
> > >>>>> <
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 226+-+
> > >>>> Dynamic+Broker+Configuration
> > >>>>
> > >>>>> I've only read about this, I haven't personally tried it.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks for pointing this out, James!  Stanislav, we should
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> probably
> > >>>>
> > >>>>> add a
> > >>>>>>>>>>>
> > >>>>>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere in
> > the
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> KIP.
> > >>>>>>
> > >>>>>>> Maybe
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> in the intro section?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I think it's clear that requiring the users to manually
> > restart
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> the
> > >>>>
> > >>>>> log
> > >>>>>>>>>
> > >>>>>>>>>> cleaner is not a very good solution.  But it's good to know
> that
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> it's a
> > >>>>>>>>>
> > >>>>>>>>>> possibility on some older releases.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Some comments:
> > >>>>>>>>>>>>>> * I like the idea of having the log cleaner continue to
> > clean
> > >>>>>>>>>>>>>> as
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> many
> > >>>>>>>>>
> > >>>>>>>>>> partitions as it can, skipping over the problematic ones if
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> possible.
> > >>>>>>>>>
> > >>>>>>>>>> * If the log cleaner thread dies, I think it should
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> automatically
> > >>>>
> > >>>>> be
> > >>>>>>
> > >>>>>>> revived. Your KIP attempts to do that by catching exceptions
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> during
> > >>>>>>
> > >>>>>>> execution, but I think we should go all the way and make sure
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> that a
> > >>>>>>
> > >>>>>>> new
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> one gets created, if the thread ever dies.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> This is inconsistent with the way the rest of Kafka works.
> > We
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> don't
> > >>>>>>
> > >>>>>>> automatically re-create other threads in the broker if they
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> terminate.
> > >>>>>>>>>
> > >>>>>>>>>> In
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> general, if there is a serious bug in the code, respawning
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> threads
> > >>>>
> > >>>>> is
> > >>>>>>
> > >>>>>>> likely to make things worse, by putting you in an infinite loop
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> which
> > >>>>>>
> > >>>>>>> consumes resources and fires off continuous log messages.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> * It might be worth trying to re-clean the uncleanable
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> partitions.
> > >>>>
> > >>>>> I've
> > >>>>>>>>>>>
> > >>>>>>>>>>>> seen cases where an uncleanable partition later became
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> cleanable.
> > >>>>
> > >>>>> I
> > >>>>>>
> > >>>>>>> unfortunately don't remember how that happened, but I remember
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> being
> > >>>>>>
> > >>>>>>> surprised when I discovered it. It might have been something
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> like
> > >>>>
> > >>>>> a
> > >>>>>>
> > >>>>>>> follower was uncleanable but after a leader election happened,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> the
> > >>>>
> > >>>>> log
> > >>>>>>>>>>>
> > >>>>>>>>>>>> truncated and it was then cleanable again. I'm not sure.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> James, I disagree.  We had this behavior in the Hadoop
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> Distributed
> > >>>>
> > >>>>> File
> > >>>>>>>>>
> > >>>>>>>>>> System (HDFS) and it was a constant source of user problems.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> What would happen is disks would just go bad over time.
> The
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> DataNode
> > >>>>>>
> > >>>>>>> would notice this and take them offline.  But then, due to some
> > >>>>>>>>>>>>> "optimistic" code, the DataNode would periodically try to
> > >>>>>>>>>>>>> re-add
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> them
> > >>>>>>
> > >>>>>>> to
> > >>>>>>>>>>>
> > >>>>>>>>>>>> the system.  Then one of two things would happen: the disk
> > would
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> just
> > >>>>>>
> > >>>>>>> fail
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> immediately again, or it would appear to work and then fail
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> after a
> > >>>>
> > >>>>> short
> > >>>>>>>>>>>
> > >>>>>>>>>>>> amount of time.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The way the disk failed was normally having an I/O request
> > >>>>>>>>>>>>> take a
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> really
> > >>>>>>>>>>>
> > >>>>>>>>>>>> long time and time out.  So a bunch of request handler
> threads
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> would
> > >>>>>>
> > >>>>>>> basically slam into a brick wall when they tried to access the
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> bad
> > >>>>
> > >>>>> disk,
> > >>>>>>>>>>>
> > >>>>>>>>>>>> slowing the DataNode to a crawl.  It was even worse in the
> > >>>>>>>>>>>>> second
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> scenario,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> if the disk appeared to work for a while, but then failed.
> > Any
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> data
> > >>>>>>
> > >>>>>>> that
> > >>>>>>>>>>>
> > >>>>>>>>>>>> had been written on that DataNode to that disk would be
> lost,
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> we
> > >>>>>>
> > >>>>>>> would
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> need to re-replicate it.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Disks aren't biological systems-- they don't heal over
> time.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> Once
> > >>>>
> > >>>>> they're
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> bad, they stay bad.  The log cleaner needs to be robust
> > against
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> cases
> > >>>>>>
> > >>>>>>> where
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> the disk really is failing, and really is returning bad
> data
> > or
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> timing
> > >>>>>>>>>
> > >>>>>>>>>> out.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> * For your metrics, can you spell out the full metric in
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> JMX-style
> > >>>>
> > >>>>> format, such as:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>   kafka.log:type=LogCleanerManager,name=uncleanable-
> > >>>> partitions-count
> > >>>>
> > >>>>>                 value=4
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> * For "uncleanable-partitions": topic-partition names can
> be
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> very
> > >>>>
> > >>>>> long.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> I think the current max size is 210 characters (or maybe
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> 240-ish?).
> > >>>>>>
> > >>>>>>> Having the "uncleanable-partitions" being a list could be very
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> large
> > >>>>>>
> > >>>>>>> metric. Also, having the metric come out as a csv might be
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> difficult
> > >>>>>>
> > >>>>>>> to
> > >>>>>>>>>>>
> > >>>>>>>>>>>> work with for monitoring systems. If we *did* want the topic
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> names
> > >>>>
> > >>>>> to
> > >>>>>>>>>
> > >>>>>>>>>> be
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> accessible, what do you think of having the
> > >>>>>>>>>>>>>>         kafka.log:type=LogCleanerManag
> > >>>>>>>>>>>>>> er,topic=topic1,partition=2
> > >>>>>>>>>>>>>> I'm not sure if LogCleanerManager is the right type, but
> my
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> example
> > >>>>>>
> > >>>>>>> was
> > >>>>>>>>>>>
> > >>>>>>>>>>>> that the topic and partition can be tags in the metric. That
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> will
> > >>>>
> > >>>>> allow
> > >>>>>>>>>>>
> > >>>>>>>>>>>> monitoring systems to more easily slice and dice the metric.
> > I'm
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> not
> > >>>>>>
> > >>>>>>> sure what the attribute for that metric would be. Maybe
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> something
> > >>>>
> > >>>>> like
> > >>>>>>>>>>>
> > >>>>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> time-since-last-clean?
> > >>>>>>>>>
> > >>>>>>>>>> Or
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> maybe even just "Value=1".
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> I haven't though about this that hard, but do we really
> need
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>> uncleanable topic names to be accessible through a metric?
> > It
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> seems
> > >>>>>>
> > >>>>>>> like
> > >>>>>>>>>>>
> > >>>>>>>>>>>> the admin should notice that uncleanable partitions are
> > present,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> and
> > >>>>>>
> > >>>>>>> then
> > >>>>>>>>>>>
> > >>>>>>>>>>>> check the logs?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> * About `max.uncleanable.partitions`, you said that this
> > likely
> > >>>>>>>>>>>>>> indicates that the disk is having problems. I'm not sure
> > that
> > >>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> the
> > >>>>>>
> > >>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> problems,
> > >>>>>>
> > >>>>>>> all
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> of them are partition-level scenarios that happened during
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> normal
> > >>>>
> > >>>>> operation. None of them were indicative of disk problems.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> I don't think this is a meaningful comparison.  In general,
> > we
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> don't
> > >>>>>>
> > >>>>>>> accept JIRAs for hard disk problems that happen on a particular
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> cluster.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> If someone opened a JIRA that said "my hard disk is having
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> problems"
> > >>>>>>
> > >>>>>>> we
> > >>>>>>>>>
> > >>>>>>>>>> could close that as "not a Kafka bug."  This doesn't prove
> that
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> disk
> > >>>>>>
> > >>>>>>> problems don't happen, but  just that JIRA isn't the right place
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> for
> > >>>>>>
> > >>>>>>> them.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> I do agree that the log cleaner has had a significant
> number
> > of
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> logic
> > >>>>>>
> > >>>>>>> bugs, and that we need to be careful to limit their impact.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> That's
> > >>>>
> > >>>>> one
> > >>>>>>>>>
> > >>>>>>>>>> reason why I think that a threshold of "number of uncleanable
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> logs"
> > >>>>
> > >>>>> is
> > >>>>>>>>>
> > >>>>>>>>>> a
> > >>>>>>>>>>>
> > >>>>>>>>>>>> good idea, rather than just failing after one IOException.
> In
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> all
> > >>>>
> > >>>>> the
> > >>>>>>>>>
> > >>>>>>>>>> cases I've seen where a user hit a logic bug in the log
> cleaner,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> it
> > >>>>
> > >>>>> was
> > >>>>>>>>>
> > >>>>>>>>>> just one partition that had the issue.  We also should
> increase
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> test
> > >>>>>>
> > >>>>>>> coverage for the log cleaner.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> * About marking disks as offline when exceeding a certain
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> threshold,
> > >>>>>>
> > >>>>>>> that actually increases the blast radius of log compaction
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> failures.
> > >>>>>>
> > >>>>>>> Currently, the uncleaned partitions are still readable and
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> writable.
> > >>>>>>
> > >>>>>>> Taking the disks offline would impact availability of the
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> uncleanable
> > >>>>>>>>>
> > >>>>>>>>>> partitions, as well as impact all other partitions that are on
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> the
> > >>>>
> > >>>>> disk.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> In general, when we encounter I/O errors, we take the disk
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> partition
> > >>>>>>
> > >>>>>>> offline.  This is spelled out in KIP-112 (
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%
> > >>>> 3A+Handle+disk+failure+for+JBOD
> > >>>>
> > >>>>> ) :
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Broker assumes a log directory to be good after it
> starts,
> > >>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> mark
> > >>>>>>>>>
> > >>>>>>>>>> log directory as
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> bad once there is IOException when broker attempts to
> access
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> (i.e.
> > >>>>
> > >>>>> read
> > >>>>>>>>>>>
> > >>>>>>>>>>>> or write) the log directory.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> - Broker will be offline if all log directories are bad.
> > >>>>>>>>>>>>>> - Broker will stop serving replicas in any bad log
> > directory.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> New
> > >>>
> > >>>
> >
>
>
> --
> Best,
> Stanislav
>

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Stanislav Kozlovski <st...@confluent.io>.

Hey Jason,

1. *10* is the default value, it says so in the KIP
2. This is a good catch. As the current implementation stands, it's not a
useful metric since the thread continues to run even if all log directories
are offline (although I'm not sure what the broker's behavior is in that
scenario). I'll make sure the thread stops if all log directories are
online.

I don't know which "Needs Discussion" item you're referencing, there hasn't
been any in the KIP since August 1 and that was for the metric only. KIP
History
<https://cwiki.apache.org/confluence/pages/viewpreviousversions.action?pageId=89064875>

I've updated the KIP to mention the "time-since-last-run" metric.

Thanks,
Stanislav

On Wed, Aug 8, 2018 at 12:12 AM Jason Gustafson <ja...@confluent.io> wrote:

> Hi Stanislav,
>
> Just a couple quick questions:
>
> 1. I may have missed it, but what will be the default value for
> `max.uncleanable.partitions`?
> 2. It seems there will be some impact for users that monitoring
> "time-since-last-run-ms" in order to detect cleaner failures. Not sure it's
> a major concern, but probably worth mentioning in the compatibility
> section. Also, is this still a useful metric after this KIP?
>
> Also, maybe the "Needs Discussion" item can be moved to rejected
> alternatives since you've moved to a vote? I think leaving this for
> potential future work is reasonable.
>
> Thanks,
> Jason
>
>
> On Mon, Aug 6, 2018 at 12:29 PM, Ray Chiang <rc...@apache.org> wrote:
>
> > I'm okay with that.
> >
> > -Ray
> >
> > On 8/6/18 10:59 AM, Colin McCabe wrote:
> >
> >> Perhaps we could start with max.uncleanable.partitions and then
> implement
> >> max.uncleanable.partitions.per.logdir in a follow-up change if it seemed
> >> to be necessary?  What do you think?
> >>
> >> regards,
> >> Colin
> >>
> >>
> >> On Sat, Aug 4, 2018, at 10:53, Stanislav Kozlovski wrote:
> >>
> >>> Hey Ray,
> >>>
> >>> Thanks for the explanation. In regards to the configuration property -
> >>> I'm
> >>> not sure. As long as it has sufficient documentation, I find
> >>> "max.uncleanable.partitions" to be okay. If we were to add the
> >>> distinction
> >>> explicitly, maybe it should be `max.uncleanable.partitions.per.logdir`
> ?
> >>>
> >>> On Thu, Aug 2, 2018 at 7:32 PM Ray Chiang <rc...@apache.org> wrote:
> >>>
> >>> One more thing occurred to me.  Should the configuration property be
> >>>> named "max.uncleanable.partitions.per.disk" instead?
> >>>>
> >>>> -Ray
> >>>>
> >>>>
> >>>> On 8/1/18 9:11 AM, Stanislav Kozlovski wrote:
> >>>>
> >>>>> Yes, good catch. Thank you, James!
> >>>>>
> >>>>> Best,
> >>>>> Stanislav
> >>>>>
> >>>>> On Wed, Aug 1, 2018 at 5:05 PM James Cheng <wu...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>> Can you update the KIP to say what the default is for
> >>>>>> max.uncleanable.partitions?
> >>>>>>
> >>>>>> -James
> >>>>>>
> >>>>>> Sent from my iPhone
> >>>>>>
> >>>>>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <
> >>>>>>>
> >>>>>> stanislav@confluent.io>
> >>>>
> >>>>> wrote:
> >>>>>>
> >>>>>>> Hey group,
> >>>>>>>
> >>>>>>> I am planning on starting a voting thread tomorrow. Please do reply
> >>>>>>> if
> >>>>>>>
> >>>>>> you
> >>>>>>
> >>>>>>> feel there is anything left to discuss.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Stanislav
> >>>>>>>
> >>>>>>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
> >>>>>>>
> >>>>>> stanislav@confluent.io>
> >>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> Hey, Ray
> >>>>>>>>
> >>>>>>>> Thanks for pointing that out, it's fixed now
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Stanislav
> >>>>>>>>
> >>>>>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rc...@apache.org>
> >>>>>>>>>
> >>>>>>>> wrote:
> >>>>
> >>>>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
> >>>>>>>>> the main KIP landing page
> >>>>>>>>> <
> >>>>>>>>>
> >>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+
> >>>> Improvement+Proposals#
> >>>>
> >>>>> ?
> >>>>>>>
> >>>>>>>> I tried, but the Wiki won't let me.
> >>>>>>>>>
> >>>>>>>>> -Ray
> >>>>>>>>>
> >>>>>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
> >>>>>>>>>> Hey guys,
> >>>>>>>>>>
> >>>>>>>>>> @Colin - good point. I added some sentences mentioning recent
> >>>>>>>>>>
> >>>>>>>>> improvements
> >>>>>>>>>
> >>>>>>>>>> in the introductory section.
> >>>>>>>>>>
> >>>>>>>>>> *Disk Failure* - I tend to agree with what Colin said - once a
> >>>>>>>>>> disk
> >>>>>>>>>>
> >>>>>>>>> fails,
> >>>>>>>>>
> >>>>>>>>>> you don't want to work with it again. As such, I've changed my
> >>>>>>>>>> mind
> >>>>>>>>>>
> >>>>>>>>> and
> >>>>>>
> >>>>>>> believe that we should mark the LogDir (assume its a disk) as
> >>>>>>>>>>
> >>>>>>>>> offline
> >>>>
> >>>>> on
> >>>>>>
> >>>>>>> the first `IOException` encountered. This is the LogCleaner's
> >>>>>>>>>>
> >>>>>>>>> current
> >>>>
> >>>>> behavior. We shouldn't change that.
> >>>>>>>>>>
> >>>>>>>>>> *Respawning Threads* - I believe we should never re-spawn a
> >>>>>>>>>> thread.
> >>>>>>>>>>
> >>>>>>>>> The
> >>>>>>
> >>>>>>> correct approach in my mind is to either have it stay dead or never
> >>>>>>>>>>
> >>>>>>>>> let
> >>>>>>
> >>>>>>> it
> >>>>>>>>>
> >>>>>>>>>> die in the first place.
> >>>>>>>>>>
> >>>>>>>>>> *Uncleanable-partition-names metric* - Colin is right, this
> metric
> >>>>>>>>>>
> >>>>>>>>> is
> >>>>
> >>>>> unneeded. Users can monitor the `uncleanable-partitions-count`
> >>>>>>>>>>
> >>>>>>>>> metric
> >>>>
> >>>>> and
> >>>>>>>>>
> >>>>>>>>>> inspect logs.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hey Ray,
> >>>>>>>>>>
> >>>>>>>>>> 2) I'm 100% with James in agreement with setting up the
> LogCleaner
> >>>>>>>>>>>
> >>>>>>>>>> to
> >>>>
> >>>>> skip over problematic partitions instead of dying.
> >>>>>>>>>>>
> >>>>>>>>>> I think we can do this for every exception that isn't
> >>>>>>>>>> `IOException`.
> >>>>>>>>>>
> >>>>>>>>> This
> >>>>>>>>>
> >>>>>>>>>> will future-proof us against bugs in the system and potential
> >>>>>>>>>> other
> >>>>>>>>>>
> >>>>>>>>> errors.
> >>>>>>>>>
> >>>>>>>>>> Protecting yourself against unexpected failures is always a good
> >>>>>>>>>>
> >>>>>>>>> thing
> >>>>
> >>>>> in
> >>>>>>>>>
> >>>>>>>>>> my mind, but I also think that protecting yourself against bugs
> in
> >>>>>>>>>>
> >>>>>>>>> the
> >>>>
> >>>>> software is sort of clunky. What does everybody think about this?
> >>>>>>>>>>
> >>>>>>>>>> 4) The only improvement I can think of is that if such an
> >>>>>>>>>>> error occurs, then have the option (configuration setting?) to
> >>>>>>>>>>>
> >>>>>>>>>> create a
> >>>>>>
> >>>>>>> <log_segment>.skip file (or something similar).
> >>>>>>>>>>>
> >>>>>>>>>> This is a good suggestion. Have others also seen corruption be
> >>>>>>>>>>
> >>>>>>>>> generally
> >>>>>>
> >>>>>>> tied to the same segment?
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <
> >>>>>>>>>> dhruvil@confluent.io
> >>>>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> For the cleaner thread specifically, I do not think respawning
> >>>>>>>>>>> will
> >>>>>>>>>>>
> >>>>>>>>>> help at
> >>>>>>>>>
> >>>>>>>>>> all because we are more than likely to run into the same issue
> >>>>>>>>>>>
> >>>>>>>>>> again
> >>>>
> >>>>> which
> >>>>>>>>>
> >>>>>>>>>> would end up crashing the cleaner. Retrying makes sense for
> >>>>>>>>>>>
> >>>>>>>>>> transient
> >>>>
> >>>>> errors or when you believe some part of the system could have
> >>>>>>>>>>>
> >>>>>>>>>> healed
> >>>>
> >>>>> itself, both of which I think are not true for the log cleaner.
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <
> >>>>>>>>>>> rndgstn@gmail.com>
> >>>>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> <<<respawning threads is likely to make things worse, by putting
> >>>>>>>>>>>>
> >>>>>>>>>>> you
> >>>>
> >>>>> in
> >>>>>>>>>
> >>>>>>>>>> an
> >>>>>>>>>>>
> >>>>>>>>>>>> infinite loop which consumes resources and fires off
> continuous
> >>>>>>>>>>>>
> >>>>>>>>>>> log
> >>>>
> >>>>> messages.
> >>>>>>>>>>>> Hi Colin.  In case it could be relevant, one way to mitigate
> >>>>>>>>>>>> this
> >>>>>>>>>>>>
> >>>>>>>>>>> effect
> >>>>>>>>>
> >>>>>>>>>> is
> >>>>>>>>>>>
> >>>>>>>>>>>> to implement a backoff mechanism (if a second respawn is to
> >>>>>>>>>>>> occur
> >>>>>>>>>>>>
> >>>>>>>>>>> then
> >>>>>>
> >>>>>>> wait
> >>>>>>>>>>>
> >>>>>>>>>>>> for 1 minute before doing it; then if a third respawn is to
> >>>>>>>>>>>> occur
> >>>>>>>>>>>>
> >>>>>>>>>>> wait
> >>>>>>
> >>>>>>> for
> >>>>>>>>>>>
> >>>>>>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up
> to
> >>>>>>>>>>>>
> >>>>>>>>>>> some
> >>>>>>
> >>>>>>> max
> >>>>>>>>>
> >>>>>>>>>> wait time).
> >>>>>>>>>>>>
> >>>>>>>>>>>> I have no opinion on whether respawn is appropriate or not in
> >>>>>>>>>>>> this
> >>>>>>>>>>>>
> >>>>>>>>>>> context,
> >>>>>>>>>>>
> >>>>>>>>>>>> but a mitigation like the increasing backoff described above
> may
> >>>>>>>>>>>>
> >>>>>>>>>>> be
> >>>>
> >>>>> relevant in weighing the pros and cons.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Ron
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <
> >>>>>>>>>>>> cmccabe@apache.org>
> >>>>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> >>>>>>>>>>>>>> Hi Stanislav! Thanks for this KIP!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I agree that it would be good if the LogCleaner were more
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> tolerant
> >>>>
> >>>>> of
> >>>>>>>>>
> >>>>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Things are better now than they used to be. We have the
> metric
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
> >>>>
> >>>>> which we can use to tell us if the threads are dead. And as of
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> 1.1.0,
> >>>>>>>>>
> >>>>>>>>>> we
> >>>>>>>>>>>>
> >>>>>>>>>>>>> have KIP-226, which allows you to restart the log cleaner
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> thread,
> >>>>
> >>>>> without requiring a broker restart.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+
> >>>> Dynamic+Broker+Configuration
> >>>>
> >>>>> <
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+
> >>>> Dynamic+Broker+Configuration
> >>>>
> >>>>> I've only read about this, I haven't personally tried it.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for pointing this out, James!  Stanislav, we should
> >>>>>>>>>>>>>
> >>>>>>>>>>>> probably
> >>>>
> >>>>> add a
> >>>>>>>>>>>
> >>>>>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere in
> the
> >>>>>>>>>>>>>
> >>>>>>>>>>>> KIP.
> >>>>>>
> >>>>>>> Maybe
> >>>>>>>>>>>>
> >>>>>>>>>>>>> in the intro section?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I think it's clear that requiring the users to manually
> restart
> >>>>>>>>>>>>>
> >>>>>>>>>>>> the
> >>>>
> >>>>> log
> >>>>>>>>>
> >>>>>>>>>> cleaner is not a very good solution.  But it's good to know that
> >>>>>>>>>>>>>
> >>>>>>>>>>>> it's a
> >>>>>>>>>
> >>>>>>>>>> possibility on some older releases.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Some comments:
> >>>>>>>>>>>>>> * I like the idea of having the log cleaner continue to
> clean
> >>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> many
> >>>>>>>>>
> >>>>>>>>>> partitions as it can, skipping over the problematic ones if
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> possible.
> >>>>>>>>>
> >>>>>>>>>> * If the log cleaner thread dies, I think it should
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> automatically
> >>>>
> >>>>> be
> >>>>>>
> >>>>>>> revived. Your KIP attempts to do that by catching exceptions
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> during
> >>>>>>
> >>>>>>> execution, but I think we should go all the way and make sure
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> that a
> >>>>>>
> >>>>>>> new
> >>>>>>>>>>>>
> >>>>>>>>>>>>> one gets created, if the thread ever dies.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> This is inconsistent with the way the rest of Kafka works.
> We
> >>>>>>>>>>>>>
> >>>>>>>>>>>> don't
> >>>>>>
> >>>>>>> automatically re-create other threads in the broker if they
> >>>>>>>>>>>>>
> >>>>>>>>>>>> terminate.
> >>>>>>>>>
> >>>>>>>>>> In
> >>>>>>>>>>>>
> >>>>>>>>>>>>> general, if there is a serious bug in the code, respawning
> >>>>>>>>>>>>>
> >>>>>>>>>>>> threads
> >>>>
> >>>>> is
> >>>>>>
> >>>>>>> likely to make things worse, by putting you in an infinite loop
> >>>>>>>>>>>>>
> >>>>>>>>>>>> which
> >>>>>>
> >>>>>>> consumes resources and fires off continuous log messages.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * It might be worth trying to re-clean the uncleanable
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> partitions.
> >>>>
> >>>>> I've
> >>>>>>>>>>>
> >>>>>>>>>>>> seen cases where an uncleanable partition later became
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> cleanable.
> >>>>
> >>>>> I
> >>>>>>
> >>>>>>> unfortunately don't remember how that happened, but I remember
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> being
> >>>>>>
> >>>>>>> surprised when I discovered it. It might have been something
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> like
> >>>>
> >>>>> a
> >>>>>>
> >>>>>>> follower was uncleanable but after a leader election happened,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> the
> >>>>
> >>>>> log
> >>>>>>>>>>>
> >>>>>>>>>>>> truncated and it was then cleanable again. I'm not sure.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> James, I disagree.  We had this behavior in the Hadoop
> >>>>>>>>>>>>>
> >>>>>>>>>>>> Distributed
> >>>>
> >>>>> File
> >>>>>>>>>
> >>>>>>>>>> System (HDFS) and it was a constant source of user problems.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> What would happen is disks would just go bad over time.  The
> >>>>>>>>>>>>>
> >>>>>>>>>>>> DataNode
> >>>>>>
> >>>>>>> would notice this and take them offline.  But then, due to some
> >>>>>>>>>>>>> "optimistic" code, the DataNode would periodically try to
> >>>>>>>>>>>>> re-add
> >>>>>>>>>>>>>
> >>>>>>>>>>>> them
> >>>>>>
> >>>>>>> to
> >>>>>>>>>>>
> >>>>>>>>>>>> the system.  Then one of two things would happen: the disk
> would
> >>>>>>>>>>>>>
> >>>>>>>>>>>> just
> >>>>>>
> >>>>>>> fail
> >>>>>>>>>>>>
> >>>>>>>>>>>>> immediately again, or it would appear to work and then fail
> >>>>>>>>>>>>>
> >>>>>>>>>>>> after a
> >>>>
> >>>>> short
> >>>>>>>>>>>
> >>>>>>>>>>>> amount of time.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The way the disk failed was normally having an I/O request
> >>>>>>>>>>>>> take a
> >>>>>>>>>>>>>
> >>>>>>>>>>>> really
> >>>>>>>>>>>
> >>>>>>>>>>>> long time and time out.  So a bunch of request handler threads
> >>>>>>>>>>>>>
> >>>>>>>>>>>> would
> >>>>>>
> >>>>>>> basically slam into a brick wall when they tried to access the
> >>>>>>>>>>>>>
> >>>>>>>>>>>> bad
> >>>>
> >>>>> disk,
> >>>>>>>>>>>
> >>>>>>>>>>>> slowing the DataNode to a crawl.  It was even worse in the
> >>>>>>>>>>>>> second
> >>>>>>>>>>>>>
> >>>>>>>>>>>> scenario,
> >>>>>>>>>>>>
> >>>>>>>>>>>>> if the disk appeared to work for a while, but then failed.
> Any
> >>>>>>>>>>>>>
> >>>>>>>>>>>> data
> >>>>>>
> >>>>>>> that
> >>>>>>>>>>>
> >>>>>>>>>>>> had been written on that DataNode to that disk would be lost,
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>
> >>>>>>>>>>>> we
> >>>>>>
> >>>>>>> would
> >>>>>>>>>>>>
> >>>>>>>>>>>>> need to re-replicate it.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Disks aren't biological systems-- they don't heal over time.
> >>>>>>>>>>>>>
> >>>>>>>>>>>> Once
> >>>>
> >>>>> they're
> >>>>>>>>>>>>
> >>>>>>>>>>>>> bad, they stay bad.  The log cleaner needs to be robust
> against
> >>>>>>>>>>>>>
> >>>>>>>>>>>> cases
> >>>>>>
> >>>>>>> where
> >>>>>>>>>>>>
> >>>>>>>>>>>>> the disk really is failing, and really is returning bad data
> or
> >>>>>>>>>>>>>
> >>>>>>>>>>>> timing
> >>>>>>>>>
> >>>>>>>>>> out.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> * For your metrics, can you spell out the full metric in
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> JMX-style
> >>>>
> >>>>> format, such as:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>   kafka.log:type=LogCleanerManager,name=uncleanable-
> >>>> partitions-count
> >>>>
> >>>>>                 value=4
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> * For "uncleanable-partitions": topic-partition names can be
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> very
> >>>>
> >>>>> long.
> >>>>>>>>>>>
> >>>>>>>>>>>> I think the current max size is 210 characters (or maybe
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> 240-ish?).
> >>>>>>
> >>>>>>> Having the "uncleanable-partitions" being a list could be very
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> large
> >>>>>>
> >>>>>>> metric. Also, having the metric come out as a csv might be
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> difficult
> >>>>>>
> >>>>>>> to
> >>>>>>>>>>>
> >>>>>>>>>>>> work with for monitoring systems. If we *did* want the topic
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> names
> >>>>
> >>>>> to
> >>>>>>>>>
> >>>>>>>>>> be
> >>>>>>>>>>>>
> >>>>>>>>>>>>> accessible, what do you think of having the
> >>>>>>>>>>>>>>         kafka.log:type=LogCleanerManag
> >>>>>>>>>>>>>> er,topic=topic1,partition=2
> >>>>>>>>>>>>>> I'm not sure if LogCleanerManager is the right type, but my
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> example
> >>>>>>
> >>>>>>> was
> >>>>>>>>>>>
> >>>>>>>>>>>> that the topic and partition can be tags in the metric. That
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> will
> >>>>
> >>>>> allow
> >>>>>>>>>>>
> >>>>>>>>>>>> monitoring systems to more easily slice and dice the metric.
> I'm
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> not
> >>>>>>
> >>>>>>> sure what the attribute for that metric would be. Maybe
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> something
> >>>>
> >>>>> like
> >>>>>>>>>>>
> >>>>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> time-since-last-clean?
> >>>>>>>>>
> >>>>>>>>>> Or
> >>>>>>>>>>>>
> >>>>>>>>>>>>> maybe even just "Value=1".
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> I haven't though about this that hard, but do we really need
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>> uncleanable topic names to be accessible through a metric?
> It
> >>>>>>>>>>>>>
> >>>>>>>>>>>> seems
> >>>>>>
> >>>>>>> like
> >>>>>>>>>>>
> >>>>>>>>>>>> the admin should notice that uncleanable partitions are
> present,
> >>>>>>>>>>>>>
> >>>>>>>>>>>> and
> >>>>>>
> >>>>>>> then
> >>>>>>>>>>>
> >>>>>>>>>>>> check the logs?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * About `max.uncleanable.partitions`, you said that this
> likely
> >>>>>>>>>>>>>> indicates that the disk is having problems. I'm not sure
> that
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> the
> >>>>>>
> >>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> problems,
> >>>>>>
> >>>>>>> all
> >>>>>>>>>>>>
> >>>>>>>>>>>>> of them are partition-level scenarios that happened during
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> normal
> >>>>
> >>>>> operation. None of them were indicative of disk problems.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> I don't think this is a meaningful comparison.  In general,
> we
> >>>>>>>>>>>>>
> >>>>>>>>>>>> don't
> >>>>>>
> >>>>>>> accept JIRAs for hard disk problems that happen on a particular
> >>>>>>>>>>>>>
> >>>>>>>>>>>> cluster.
> >>>>>>>>>>>
> >>>>>>>>>>>> If someone opened a JIRA that said "my hard disk is having
> >>>>>>>>>>>>>
> >>>>>>>>>>>> problems"
> >>>>>>
> >>>>>>> we
> >>>>>>>>>
> >>>>>>>>>> could close that as "not a Kafka bug."  This doesn't prove that
> >>>>>>>>>>>>>
> >>>>>>>>>>>> disk
> >>>>>>
> >>>>>>> problems don't happen, but  just that JIRA isn't the right place
> >>>>>>>>>>>>>
> >>>>>>>>>>>> for
> >>>>>>
> >>>>>>> them.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I do agree that the log cleaner has had a significant number
> of
> >>>>>>>>>>>>>
> >>>>>>>>>>>> logic
> >>>>>>
> >>>>>>> bugs, and that we need to be careful to limit their impact.
> >>>>>>>>>>>>>
> >>>>>>>>>>>> That's
> >>>>
> >>>>> one
> >>>>>>>>>
> >>>>>>>>>> reason why I think that a threshold of "number of uncleanable
> >>>>>>>>>>>>>
> >>>>>>>>>>>> logs"
> >>>>
> >>>>> is
> >>>>>>>>>
> >>>>>>>>>> a
> >>>>>>>>>>>
> >>>>>>>>>>>> good idea, rather than just failing after one IOException.  In
> >>>>>>>>>>>>>
> >>>>>>>>>>>> all
> >>>>
> >>>>> the
> >>>>>>>>>
> >>>>>>>>>> cases I've seen where a user hit a logic bug in the log cleaner,
> >>>>>>>>>>>>>
> >>>>>>>>>>>> it
> >>>>
> >>>>> was
> >>>>>>>>>
> >>>>>>>>>> just one partition that had the issue.  We also should increase
> >>>>>>>>>>>>>
> >>>>>>>>>>>> test
> >>>>>>
> >>>>>>> coverage for the log cleaner.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * About marking disks as offline when exceeding a certain
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> threshold,
> >>>>>>
> >>>>>>> that actually increases the blast radius of log compaction
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> failures.
> >>>>>>
> >>>>>>> Currently, the uncleaned partitions are still readable and
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> writable.
> >>>>>>
> >>>>>>> Taking the disks offline would impact availability of the
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> uncleanable
> >>>>>>>>>
> >>>>>>>>>> partitions, as well as impact all other partitions that are on
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> the
> >>>>
> >>>>> disk.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> In general, when we encounter I/O errors, we take the disk
> >>>>>>>>>>>>>
> >>>>>>>>>>>> partition
> >>>>>>
> >>>>>>> offline.  This is spelled out in KIP-112 (
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%
> >>>> 3A+Handle+disk+failure+for+JBOD
> >>>>
> >>>>> ) :
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Broker assumes a log directory to be good after it starts,
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> mark
> >>>>>>>>>
> >>>>>>>>>> log directory as
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> bad once there is IOException when broker attempts to access
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> (i.e.
> >>>>
> >>>>> read
> >>>>>>>>>>>
> >>>>>>>>>>>> or write) the log directory.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> - Broker will be offline if all log directories are bad.
> >>>>>>>>>>>>>> - Broker will stop serving replicas in any bad log
> directory.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> New
> >>>
> >>>
>


-- 
Best,
Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Jason Gustafson <ja...@confluent.io>.

Hi Stanislav,

Just a couple quick questions:

1. I may have missed it, but what will be the default value for
`max.uncleanable.partitions`?
2. It seems there will be some impact for users that monitoring
"time-since-last-run-ms" in order to detect cleaner failures. Not sure it's
a major concern, but probably worth mentioning in the compatibility
section. Also, is this still a useful metric after this KIP?

Also, maybe the "Needs Discussion" item can be moved to rejected
alternatives since you've moved to a vote? I think leaving this for
potential future work is reasonable.

Thanks,
Jason


On Mon, Aug 6, 2018 at 12:29 PM, Ray Chiang <rc...@apache.org> wrote:

> I'm okay with that.
>
> -Ray
>
> On 8/6/18 10:59 AM, Colin McCabe wrote:
>
>> Perhaps we could start with max.uncleanable.partitions and then implement
>> max.uncleanable.partitions.per.logdir in a follow-up change if it seemed
>> to be necessary?  What do you think?
>>
>> regards,
>> Colin
>>
>>
>> On Sat, Aug 4, 2018, at 10:53, Stanislav Kozlovski wrote:
>>
>>> Hey Ray,
>>>
>>> Thanks for the explanation. In regards to the configuration property -
>>> I'm
>>> not sure. As long as it has sufficient documentation, I find
>>> "max.uncleanable.partitions" to be okay. If we were to add the
>>> distinction
>>> explicitly, maybe it should be `max.uncleanable.partitions.per.logdir` ?
>>>
>>> On Thu, Aug 2, 2018 at 7:32 PM Ray Chiang <rc...@apache.org> wrote:
>>>
>>> One more thing occurred to me.  Should the configuration property be
>>>> named "max.uncleanable.partitions.per.disk" instead?
>>>>
>>>> -Ray
>>>>
>>>>
>>>> On 8/1/18 9:11 AM, Stanislav Kozlovski wrote:
>>>>
>>>>> Yes, good catch. Thank you, James!
>>>>>
>>>>> Best,
>>>>> Stanislav
>>>>>
>>>>> On Wed, Aug 1, 2018 at 5:05 PM James Cheng <wu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Can you update the KIP to say what the default is for
>>>>>> max.uncleanable.partitions?
>>>>>>
>>>>>> -James
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <
>>>>>>>
>>>>>> stanislav@confluent.io>
>>>>
>>>>> wrote:
>>>>>>
>>>>>>> Hey group,
>>>>>>>
>>>>>>> I am planning on starting a voting thread tomorrow. Please do reply
>>>>>>> if
>>>>>>>
>>>>>> you
>>>>>>
>>>>>>> feel there is anything left to discuss.
>>>>>>>
>>>>>>> Best,
>>>>>>> Stanislav
>>>>>>>
>>>>>>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
>>>>>>>
>>>>>> stanislav@confluent.io>
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hey, Ray
>>>>>>>>
>>>>>>>> Thanks for pointing that out, it's fixed now
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Stanislav
>>>>>>>>
>>>>>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rc...@apache.org>
>>>>>>>>>
>>>>>>>> wrote:
>>>>
>>>>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
>>>>>>>>> the main KIP landing page
>>>>>>>>> <
>>>>>>>>>
>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+
>>>> Improvement+Proposals#
>>>>
>>>>> ?
>>>>>>>
>>>>>>>> I tried, but the Wiki won't let me.
>>>>>>>>>
>>>>>>>>> -Ray
>>>>>>>>>
>>>>>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
>>>>>>>>>> Hey guys,
>>>>>>>>>>
>>>>>>>>>> @Colin - good point. I added some sentences mentioning recent
>>>>>>>>>>
>>>>>>>>> improvements
>>>>>>>>>
>>>>>>>>>> in the introductory section.
>>>>>>>>>>
>>>>>>>>>> *Disk Failure* - I tend to agree with what Colin said - once a
>>>>>>>>>> disk
>>>>>>>>>>
>>>>>>>>> fails,
>>>>>>>>>
>>>>>>>>>> you don't want to work with it again. As such, I've changed my
>>>>>>>>>> mind
>>>>>>>>>>
>>>>>>>>> and
>>>>>>
>>>>>>> believe that we should mark the LogDir (assume its a disk) as
>>>>>>>>>>
>>>>>>>>> offline
>>>>
>>>>> on
>>>>>>
>>>>>>> the first `IOException` encountered. This is the LogCleaner's
>>>>>>>>>>
>>>>>>>>> current
>>>>
>>>>> behavior. We shouldn't change that.
>>>>>>>>>>
>>>>>>>>>> *Respawning Threads* - I believe we should never re-spawn a
>>>>>>>>>> thread.
>>>>>>>>>>
>>>>>>>>> The
>>>>>>
>>>>>>> correct approach in my mind is to either have it stay dead or never
>>>>>>>>>>
>>>>>>>>> let
>>>>>>
>>>>>>> it
>>>>>>>>>
>>>>>>>>>> die in the first place.
>>>>>>>>>>
>>>>>>>>>> *Uncleanable-partition-names metric* - Colin is right, this metric
>>>>>>>>>>
>>>>>>>>> is
>>>>
>>>>> unneeded. Users can monitor the `uncleanable-partitions-count`
>>>>>>>>>>
>>>>>>>>> metric
>>>>
>>>>> and
>>>>>>>>>
>>>>>>>>>> inspect logs.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hey Ray,
>>>>>>>>>>
>>>>>>>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner
>>>>>>>>>>>
>>>>>>>>>> to
>>>>
>>>>> skip over problematic partitions instead of dying.
>>>>>>>>>>>
>>>>>>>>>> I think we can do this for every exception that isn't
>>>>>>>>>> `IOException`.
>>>>>>>>>>
>>>>>>>>> This
>>>>>>>>>
>>>>>>>>>> will future-proof us against bugs in the system and potential
>>>>>>>>>> other
>>>>>>>>>>
>>>>>>>>> errors.
>>>>>>>>>
>>>>>>>>>> Protecting yourself against unexpected failures is always a good
>>>>>>>>>>
>>>>>>>>> thing
>>>>
>>>>> in
>>>>>>>>>
>>>>>>>>>> my mind, but I also think that protecting yourself against bugs in
>>>>>>>>>>
>>>>>>>>> the
>>>>
>>>>> software is sort of clunky. What does everybody think about this?
>>>>>>>>>>
>>>>>>>>>> 4) The only improvement I can think of is that if such an
>>>>>>>>>>> error occurs, then have the option (configuration setting?) to
>>>>>>>>>>>
>>>>>>>>>> create a
>>>>>>
>>>>>>> <log_segment>.skip file (or something similar).
>>>>>>>>>>>
>>>>>>>>>> This is a good suggestion. Have others also seen corruption be
>>>>>>>>>>
>>>>>>>>> generally
>>>>>>
>>>>>>> tied to the same segment?
>>>>>>>>>>
>>>>>>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <
>>>>>>>>>> dhruvil@confluent.io
>>>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> For the cleaner thread specifically, I do not think respawning
>>>>>>>>>>> will
>>>>>>>>>>>
>>>>>>>>>> help at
>>>>>>>>>
>>>>>>>>>> all because we are more than likely to run into the same issue
>>>>>>>>>>>
>>>>>>>>>> again
>>>>
>>>>> which
>>>>>>>>>
>>>>>>>>>> would end up crashing the cleaner. Retrying makes sense for
>>>>>>>>>>>
>>>>>>>>>> transient
>>>>
>>>>> errors or when you believe some part of the system could have
>>>>>>>>>>>
>>>>>>>>>> healed
>>>>
>>>>> itself, both of which I think are not true for the log cleaner.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <
>>>>>>>>>>> rndgstn@gmail.com>
>>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> <<<respawning threads is likely to make things worse, by putting
>>>>>>>>>>>>
>>>>>>>>>>> you
>>>>
>>>>> in
>>>>>>>>>
>>>>>>>>>> an
>>>>>>>>>>>
>>>>>>>>>>>> infinite loop which consumes resources and fires off continuous
>>>>>>>>>>>>
>>>>>>>>>>> log
>>>>
>>>>> messages.
>>>>>>>>>>>> Hi Colin.  In case it could be relevant, one way to mitigate
>>>>>>>>>>>> this
>>>>>>>>>>>>
>>>>>>>>>>> effect
>>>>>>>>>
>>>>>>>>>> is
>>>>>>>>>>>
>>>>>>>>>>>> to implement a backoff mechanism (if a second respawn is to
>>>>>>>>>>>> occur
>>>>>>>>>>>>
>>>>>>>>>>> then
>>>>>>
>>>>>>> wait
>>>>>>>>>>>
>>>>>>>>>>>> for 1 minute before doing it; then if a third respawn is to
>>>>>>>>>>>> occur
>>>>>>>>>>>>
>>>>>>>>>>> wait
>>>>>>
>>>>>>> for
>>>>>>>>>>>
>>>>>>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to
>>>>>>>>>>>>
>>>>>>>>>>> some
>>>>>>
>>>>>>> max
>>>>>>>>>
>>>>>>>>>> wait time).
>>>>>>>>>>>>
>>>>>>>>>>>> I have no opinion on whether respawn is appropriate or not in
>>>>>>>>>>>> this
>>>>>>>>>>>>
>>>>>>>>>>> context,
>>>>>>>>>>>
>>>>>>>>>>>> but a mitigation like the increasing backoff described above may
>>>>>>>>>>>>
>>>>>>>>>>> be
>>>>
>>>>> relevant in weighing the pros and cons.
>>>>>>>>>>>>
>>>>>>>>>>>> Ron
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <
>>>>>>>>>>>> cmccabe@apache.org>
>>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
>>>>>>>>>>>>>> Hi Stanislav! Thanks for this KIP!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I agree that it would be good if the LogCleaner were more
>>>>>>>>>>>>>>
>>>>>>>>>>>>> tolerant
>>>>
>>>>> of
>>>>>>>>>
>>>>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Things are better now than they used to be. We have the metric
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
>>>>
>>>>> which we can use to tell us if the threads are dead. And as of
>>>>>>>>>>>>>>
>>>>>>>>>>>>> 1.1.0,
>>>>>>>>>
>>>>>>>>>> we
>>>>>>>>>>>>
>>>>>>>>>>>>> have KIP-226, which allows you to restart the log cleaner
>>>>>>>>>>>>>>
>>>>>>>>>>>>> thread,
>>>>
>>>>> without requiring a broker restart.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+
>>>> Dynamic+Broker+Configuration
>>>>
>>>>> <
>>>>>>>>>>>>>>
>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+
>>>> Dynamic+Broker+Configuration
>>>>
>>>>> I've only read about this, I haven't personally tried it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for pointing this out, James!  Stanislav, we should
>>>>>>>>>>>>>
>>>>>>>>>>>> probably
>>>>
>>>>> add a
>>>>>>>>>>>
>>>>>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the
>>>>>>>>>>>>>
>>>>>>>>>>>> KIP.
>>>>>>
>>>>>>> Maybe
>>>>>>>>>>>>
>>>>>>>>>>>>> in the intro section?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think it's clear that requiring the users to manually restart
>>>>>>>>>>>>>
>>>>>>>>>>>> the
>>>>
>>>>> log
>>>>>>>>>
>>>>>>>>>> cleaner is not a very good solution.  But it's good to know that
>>>>>>>>>>>>>
>>>>>>>>>>>> it's a
>>>>>>>>>
>>>>>>>>>> possibility on some older releases.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Some comments:
>>>>>>>>>>>>>> * I like the idea of having the log cleaner continue to clean
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>
>>>>>>>>>>>>> many
>>>>>>>>>
>>>>>>>>>> partitions as it can, skipping over the problematic ones if
>>>>>>>>>>>>>>
>>>>>>>>>>>>> possible.
>>>>>>>>>
>>>>>>>>>> * If the log cleaner thread dies, I think it should
>>>>>>>>>>>>>>
>>>>>>>>>>>>> automatically
>>>>
>>>>> be
>>>>>>
>>>>>>> revived. Your KIP attempts to do that by catching exceptions
>>>>>>>>>>>>>>
>>>>>>>>>>>>> during
>>>>>>
>>>>>>> execution, but I think we should go all the way and make sure
>>>>>>>>>>>>>>
>>>>>>>>>>>>> that a
>>>>>>
>>>>>>> new
>>>>>>>>>>>>
>>>>>>>>>>>>> one gets created, if the thread ever dies.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> This is inconsistent with the way the rest of Kafka works.  We
>>>>>>>>>>>>>
>>>>>>>>>>>> don't
>>>>>>
>>>>>>> automatically re-create other threads in the broker if they
>>>>>>>>>>>>>
>>>>>>>>>>>> terminate.
>>>>>>>>>
>>>>>>>>>> In
>>>>>>>>>>>>
>>>>>>>>>>>>> general, if there is a serious bug in the code, respawning
>>>>>>>>>>>>>
>>>>>>>>>>>> threads
>>>>
>>>>> is
>>>>>>
>>>>>>> likely to make things worse, by putting you in an infinite loop
>>>>>>>>>>>>>
>>>>>>>>>>>> which
>>>>>>
>>>>>>> consumes resources and fires off continuous log messages.
>>>>>>>>>>>>>
>>>>>>>>>>>>> * It might be worth trying to re-clean the uncleanable
>>>>>>>>>>>>>>
>>>>>>>>>>>>> partitions.
>>>>
>>>>> I've
>>>>>>>>>>>
>>>>>>>>>>>> seen cases where an uncleanable partition later became
>>>>>>>>>>>>>>
>>>>>>>>>>>>> cleanable.
>>>>
>>>>> I
>>>>>>
>>>>>>> unfortunately don't remember how that happened, but I remember
>>>>>>>>>>>>>>
>>>>>>>>>>>>> being
>>>>>>
>>>>>>> surprised when I discovered it. It might have been something
>>>>>>>>>>>>>>
>>>>>>>>>>>>> like
>>>>
>>>>> a
>>>>>>
>>>>>>> follower was uncleanable but after a leader election happened,
>>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>
>>>>> log
>>>>>>>>>>>
>>>>>>>>>>>> truncated and it was then cleanable again. I'm not sure.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> James, I disagree.  We had this behavior in the Hadoop
>>>>>>>>>>>>>
>>>>>>>>>>>> Distributed
>>>>
>>>>> File
>>>>>>>>>
>>>>>>>>>> System (HDFS) and it was a constant source of user problems.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What would happen is disks would just go bad over time.  The
>>>>>>>>>>>>>
>>>>>>>>>>>> DataNode
>>>>>>
>>>>>>> would notice this and take them offline.  But then, due to some
>>>>>>>>>>>>> "optimistic" code, the DataNode would periodically try to
>>>>>>>>>>>>> re-add
>>>>>>>>>>>>>
>>>>>>>>>>>> them
>>>>>>
>>>>>>> to
>>>>>>>>>>>
>>>>>>>>>>>> the system.  Then one of two things would happen: the disk would
>>>>>>>>>>>>>
>>>>>>>>>>>> just
>>>>>>
>>>>>>> fail
>>>>>>>>>>>>
>>>>>>>>>>>>> immediately again, or it would appear to work and then fail
>>>>>>>>>>>>>
>>>>>>>>>>>> after a
>>>>
>>>>> short
>>>>>>>>>>>
>>>>>>>>>>>> amount of time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The way the disk failed was normally having an I/O request
>>>>>>>>>>>>> take a
>>>>>>>>>>>>>
>>>>>>>>>>>> really
>>>>>>>>>>>
>>>>>>>>>>>> long time and time out.  So a bunch of request handler threads
>>>>>>>>>>>>>
>>>>>>>>>>>> would
>>>>>>
>>>>>>> basically slam into a brick wall when they tried to access the
>>>>>>>>>>>>>
>>>>>>>>>>>> bad
>>>>
>>>>> disk,
>>>>>>>>>>>
>>>>>>>>>>>> slowing the DataNode to a crawl.  It was even worse in the
>>>>>>>>>>>>> second
>>>>>>>>>>>>>
>>>>>>>>>>>> scenario,
>>>>>>>>>>>>
>>>>>>>>>>>>> if the disk appeared to work for a while, but then failed.  Any
>>>>>>>>>>>>>
>>>>>>>>>>>> data
>>>>>>
>>>>>>> that
>>>>>>>>>>>
>>>>>>>>>>>> had been written on that DataNode to that disk would be lost,
>>>>>>>>>>>>> and
>>>>>>>>>>>>>
>>>>>>>>>>>> we
>>>>>>
>>>>>>> would
>>>>>>>>>>>>
>>>>>>>>>>>>> need to re-replicate it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Disks aren't biological systems-- they don't heal over time.
>>>>>>>>>>>>>
>>>>>>>>>>>> Once
>>>>
>>>>> they're
>>>>>>>>>>>>
>>>>>>>>>>>>> bad, they stay bad.  The log cleaner needs to be robust against
>>>>>>>>>>>>>
>>>>>>>>>>>> cases
>>>>>>
>>>>>>> where
>>>>>>>>>>>>
>>>>>>>>>>>>> the disk really is failing, and really is returning bad data or
>>>>>>>>>>>>>
>>>>>>>>>>>> timing
>>>>>>>>>
>>>>>>>>>> out.
>>>>>>>>>>>>
>>>>>>>>>>>>> * For your metrics, can you spell out the full metric in
>>>>>>>>>>>>>>
>>>>>>>>>>>>> JMX-style
>>>>
>>>>> format, such as:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   kafka.log:type=LogCleanerManager,name=uncleanable-
>>>> partitions-count
>>>>
>>>>>                 value=4
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> * For "uncleanable-partitions": topic-partition names can be
>>>>>>>>>>>>>>
>>>>>>>>>>>>> very
>>>>
>>>>> long.
>>>>>>>>>>>
>>>>>>>>>>>> I think the current max size is 210 characters (or maybe
>>>>>>>>>>>>>>
>>>>>>>>>>>>> 240-ish?).
>>>>>>
>>>>>>> Having the "uncleanable-partitions" being a list could be very
>>>>>>>>>>>>>>
>>>>>>>>>>>>> large
>>>>>>
>>>>>>> metric. Also, having the metric come out as a csv might be
>>>>>>>>>>>>>>
>>>>>>>>>>>>> difficult
>>>>>>
>>>>>>> to
>>>>>>>>>>>
>>>>>>>>>>>> work with for monitoring systems. If we *did* want the topic
>>>>>>>>>>>>>>
>>>>>>>>>>>>> names
>>>>
>>>>> to
>>>>>>>>>
>>>>>>>>>> be
>>>>>>>>>>>>
>>>>>>>>>>>>> accessible, what do you think of having the
>>>>>>>>>>>>>>         kafka.log:type=LogCleanerManag
>>>>>>>>>>>>>> er,topic=topic1,partition=2
>>>>>>>>>>>>>> I'm not sure if LogCleanerManager is the right type, but my
>>>>>>>>>>>>>>
>>>>>>>>>>>>> example
>>>>>>
>>>>>>> was
>>>>>>>>>>>
>>>>>>>>>>>> that the topic and partition can be tags in the metric. That
>>>>>>>>>>>>>>
>>>>>>>>>>>>> will
>>>>
>>>>> allow
>>>>>>>>>>>
>>>>>>>>>>>> monitoring systems to more easily slice and dice the metric. I'm
>>>>>>>>>>>>>>
>>>>>>>>>>>>> not
>>>>>>
>>>>>>> sure what the attribute for that metric would be. Maybe
>>>>>>>>>>>>>>
>>>>>>>>>>>>> something
>>>>
>>>>> like
>>>>>>>>>>>
>>>>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
>>>>>>>>>>>>>>
>>>>>>>>>>>>> time-since-last-clean?
>>>>>>>>>
>>>>>>>>>> Or
>>>>>>>>>>>>
>>>>>>>>>>>>> maybe even just "Value=1".
>>>>>>>>>>>>>>
>>>>>>>>>>>>> I haven't though about this that hard, but do we really need
>>>>>>>>>>>>> the
>>>>>>>>>>>>> uncleanable topic names to be accessible through a metric?  It
>>>>>>>>>>>>>
>>>>>>>>>>>> seems
>>>>>>
>>>>>>> like
>>>>>>>>>>>
>>>>>>>>>>>> the admin should notice that uncleanable partitions are present,
>>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>>
>>>>>>> then
>>>>>>>>>>>
>>>>>>>>>>>> check the logs?
>>>>>>>>>>>>>
>>>>>>>>>>>>> * About `max.uncleanable.partitions`, you said that this likely
>>>>>>>>>>>>>> indicates that the disk is having problems. I'm not sure that
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>>>
>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
>>>>>>>>>>>>>>
>>>>>>>>>>>>> problems,
>>>>>>
>>>>>>> all
>>>>>>>>>>>>
>>>>>>>>>>>>> of them are partition-level scenarios that happened during
>>>>>>>>>>>>>>
>>>>>>>>>>>>> normal
>>>>
>>>>> operation. None of them were indicative of disk problems.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't think this is a meaningful comparison.  In general, we
>>>>>>>>>>>>>
>>>>>>>>>>>> don't
>>>>>>
>>>>>>> accept JIRAs for hard disk problems that happen on a particular
>>>>>>>>>>>>>
>>>>>>>>>>>> cluster.
>>>>>>>>>>>
>>>>>>>>>>>> If someone opened a JIRA that said "my hard disk is having
>>>>>>>>>>>>>
>>>>>>>>>>>> problems"
>>>>>>
>>>>>>> we
>>>>>>>>>
>>>>>>>>>> could close that as "not a Kafka bug."  This doesn't prove that
>>>>>>>>>>>>>
>>>>>>>>>>>> disk
>>>>>>
>>>>>>> problems don't happen, but  just that JIRA isn't the right place
>>>>>>>>>>>>>
>>>>>>>>>>>> for
>>>>>>
>>>>>>> them.
>>>>>>>>>>>>
>>>>>>>>>>>>> I do agree that the log cleaner has had a significant number of
>>>>>>>>>>>>>
>>>>>>>>>>>> logic
>>>>>>
>>>>>>> bugs, and that we need to be careful to limit their impact.
>>>>>>>>>>>>>
>>>>>>>>>>>> That's
>>>>
>>>>> one
>>>>>>>>>
>>>>>>>>>> reason why I think that a threshold of "number of uncleanable
>>>>>>>>>>>>>
>>>>>>>>>>>> logs"
>>>>
>>>>> is
>>>>>>>>>
>>>>>>>>>> a
>>>>>>>>>>>
>>>>>>>>>>>> good idea, rather than just failing after one IOException.  In
>>>>>>>>>>>>>
>>>>>>>>>>>> all
>>>>
>>>>> the
>>>>>>>>>
>>>>>>>>>> cases I've seen where a user hit a logic bug in the log cleaner,
>>>>>>>>>>>>>
>>>>>>>>>>>> it
>>>>
>>>>> was
>>>>>>>>>
>>>>>>>>>> just one partition that had the issue.  We also should increase
>>>>>>>>>>>>>
>>>>>>>>>>>> test
>>>>>>
>>>>>>> coverage for the log cleaner.
>>>>>>>>>>>>>
>>>>>>>>>>>>> * About marking disks as offline when exceeding a certain
>>>>>>>>>>>>>>
>>>>>>>>>>>>> threshold,
>>>>>>
>>>>>>> that actually increases the blast radius of log compaction
>>>>>>>>>>>>>>
>>>>>>>>>>>>> failures.
>>>>>>
>>>>>>> Currently, the uncleaned partitions are still readable and
>>>>>>>>>>>>>>
>>>>>>>>>>>>> writable.
>>>>>>
>>>>>>> Taking the disks offline would impact availability of the
>>>>>>>>>>>>>>
>>>>>>>>>>>>> uncleanable
>>>>>>>>>
>>>>>>>>>> partitions, as well as impact all other partitions that are on
>>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>
>>>>> disk.
>>>>>>>>>>>>
>>>>>>>>>>>>> In general, when we encounter I/O errors, we take the disk
>>>>>>>>>>>>>
>>>>>>>>>>>> partition
>>>>>>
>>>>>>> offline.  This is spelled out in KIP-112 (
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%
>>>> 3A+Handle+disk+failure+for+JBOD
>>>>
>>>>> ) :
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Broker assumes a log directory to be good after it starts,
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>
>>>>>>>>>>>>> mark
>>>>>>>>>
>>>>>>>>>> log directory as
>>>>>>>>>>>>>
>>>>>>>>>>>>>> bad once there is IOException when broker attempts to access
>>>>>>>>>>>>>>
>>>>>>>>>>>>> (i.e.
>>>>
>>>>> read
>>>>>>>>>>>
>>>>>>>>>>>> or write) the log directory.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Broker will be offline if all log directories are bad.
>>>>>>>>>>>>>> - Broker will stop serving replicas in any bad log directory.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> New
>>>
>>>

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Ray Chiang <rc...@apache.org>.

I'm okay with that.

-Ray

On 8/6/18 10:59 AM, Colin McCabe wrote:
> Perhaps we could start with max.uncleanable.partitions and then implement max.uncleanable.partitions.per.logdir in a follow-up change if it seemed to be necessary?  What do you think?
>
> regards,
> Colin
>
>
> On Sat, Aug 4, 2018, at 10:53, Stanislav Kozlovski wrote:
>> Hey Ray,
>>
>> Thanks for the explanation. In regards to the configuration property - I'm
>> not sure. As long as it has sufficient documentation, I find
>> "max.uncleanable.partitions" to be okay. If we were to add the distinction
>> explicitly, maybe it should be `max.uncleanable.partitions.per.logdir` ?
>>
>> On Thu, Aug 2, 2018 at 7:32 PM Ray Chiang <rc...@apache.org> wrote:
>>
>>> One more thing occurred to me.  Should the configuration property be
>>> named "max.uncleanable.partitions.per.disk" instead?
>>>
>>> -Ray
>>>
>>>
>>> On 8/1/18 9:11 AM, Stanislav Kozlovski wrote:
>>>> Yes, good catch. Thank you, James!
>>>>
>>>> Best,
>>>> Stanislav
>>>>
>>>> On Wed, Aug 1, 2018 at 5:05 PM James Cheng <wu...@gmail.com> wrote:
>>>>
>>>>> Can you update the KIP to say what the default is for
>>>>> max.uncleanable.partitions?
>>>>>
>>>>> -James
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <
>>> stanislav@confluent.io>
>>>>> wrote:
>>>>>> Hey group,
>>>>>>
>>>>>> I am planning on starting a voting thread tomorrow. Please do reply if
>>>>> you
>>>>>> feel there is anything left to discuss.
>>>>>>
>>>>>> Best,
>>>>>> Stanislav
>>>>>>
>>>>>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
>>>>> stanislav@confluent.io>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey, Ray
>>>>>>>
>>>>>>> Thanks for pointing that out, it's fixed now
>>>>>>>
>>>>>>> Best,
>>>>>>> Stanislav
>>>>>>>
>>>>>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rc...@apache.org>
>>> wrote:
>>>>>>>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
>>>>>>>> the main KIP landing page
>>>>>>>> <
>>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#
>>>>>> ?
>>>>>>>> I tried, but the Wiki won't let me.
>>>>>>>>
>>>>>>>> -Ray
>>>>>>>>
>>>>>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
>>>>>>>>> Hey guys,
>>>>>>>>>
>>>>>>>>> @Colin - good point. I added some sentences mentioning recent
>>>>>>>> improvements
>>>>>>>>> in the introductory section.
>>>>>>>>>
>>>>>>>>> *Disk Failure* - I tend to agree with what Colin said - once a disk
>>>>>>>> fails,
>>>>>>>>> you don't want to work with it again. As such, I've changed my mind
>>>>> and
>>>>>>>>> believe that we should mark the LogDir (assume its a disk) as
>>> offline
>>>>> on
>>>>>>>>> the first `IOException` encountered. This is the LogCleaner's
>>> current
>>>>>>>>> behavior. We shouldn't change that.
>>>>>>>>>
>>>>>>>>> *Respawning Threads* - I believe we should never re-spawn a thread.
>>>>> The
>>>>>>>>> correct approach in my mind is to either have it stay dead or never
>>>>> let
>>>>>>>> it
>>>>>>>>> die in the first place.
>>>>>>>>>
>>>>>>>>> *Uncleanable-partition-names metric* - Colin is right, this metric
>>> is
>>>>>>>>> unneeded. Users can monitor the `uncleanable-partitions-count`
>>> metric
>>>>>>>> and
>>>>>>>>> inspect logs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hey Ray,
>>>>>>>>>
>>>>>>>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner
>>> to
>>>>>>>>>> skip over problematic partitions instead of dying.
>>>>>>>>> I think we can do this for every exception that isn't `IOException`.
>>>>>>>> This
>>>>>>>>> will future-proof us against bugs in the system and potential other
>>>>>>>> errors.
>>>>>>>>> Protecting yourself against unexpected failures is always a good
>>> thing
>>>>>>>> in
>>>>>>>>> my mind, but I also think that protecting yourself against bugs in
>>> the
>>>>>>>>> software is sort of clunky. What does everybody think about this?
>>>>>>>>>
>>>>>>>>>> 4) The only improvement I can think of is that if such an
>>>>>>>>>> error occurs, then have the option (configuration setting?) to
>>>>> create a
>>>>>>>>>> <log_segment>.skip file (or something similar).
>>>>>>>>> This is a good suggestion. Have others also seen corruption be
>>>>> generally
>>>>>>>>> tied to the same segment?
>>>>>>>>>
>>>>>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dhruvil@confluent.io
>>>>>>>> wrote:
>>>>>>>>>> For the cleaner thread specifically, I do not think respawning will
>>>>>>>> help at
>>>>>>>>>> all because we are more than likely to run into the same issue
>>> again
>>>>>>>> which
>>>>>>>>>> would end up crashing the cleaner. Retrying makes sense for
>>> transient
>>>>>>>>>> errors or when you believe some part of the system could have
>>> healed
>>>>>>>>>> itself, both of which I think are not true for the log cleaner.
>>>>>>>>>>
>>>>>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rn...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>> <<<respawning threads is likely to make things worse, by putting
>>> you
>>>>>>>> in
>>>>>>>>>> an
>>>>>>>>>>> infinite loop which consumes resources and fires off continuous
>>> log
>>>>>>>>>>> messages.
>>>>>>>>>>> Hi Colin.  In case it could be relevant, one way to mitigate this
>>>>>>>> effect
>>>>>>>>>> is
>>>>>>>>>>> to implement a backoff mechanism (if a second respawn is to occur
>>>>> then
>>>>>>>>>> wait
>>>>>>>>>>> for 1 minute before doing it; then if a third respawn is to occur
>>>>> wait
>>>>>>>>>> for
>>>>>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to
>>>>> some
>>>>>>>> max
>>>>>>>>>>> wait time).
>>>>>>>>>>>
>>>>>>>>>>> I have no opinion on whether respawn is appropriate or not in this
>>>>>>>>>> context,
>>>>>>>>>>> but a mitigation like the increasing backoff described above may
>>> be
>>>>>>>>>>> relevant in weighing the pros and cons.
>>>>>>>>>>>
>>>>>>>>>>> Ron
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cm...@apache.org>
>>>>>>>> wrote:
>>>>>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
>>>>>>>>>>>>> Hi Stanislav! Thanks for this KIP!
>>>>>>>>>>>>>
>>>>>>>>>>>>> I agree that it would be good if the LogCleaner were more
>>> tolerant
>>>>>>>> of
>>>>>>>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Things are better now than they used to be. We have the metric
>>>>>>>>>>>>>
>>> kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
>>>>>>>>>>>>> which we can use to tell us if the threads are dead. And as of
>>>>>>>> 1.1.0,
>>>>>>>>>>> we
>>>>>>>>>>>>> have KIP-226, which allows you to restart the log cleaner
>>> thread,
>>>>>>>>>>>>> without requiring a broker restart.
>>>>>>>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>>>>>>>>>>>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>>>>>>>>>>>>> I've only read about this, I haven't personally tried it.
>>>>>>>>>>>> Thanks for pointing this out, James!  Stanislav, we should
>>> probably
>>>>>>>>>> add a
>>>>>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the
>>>>> KIP.
>>>>>>>>>>> Maybe
>>>>>>>>>>>> in the intro section?
>>>>>>>>>>>>
>>>>>>>>>>>> I think it's clear that requiring the users to manually restart
>>> the
>>>>>>>> log
>>>>>>>>>>>> cleaner is not a very good solution.  But it's good to know that
>>>>>>>> it's a
>>>>>>>>>>>> possibility on some older releases.
>>>>>>>>>>>>
>>>>>>>>>>>>> Some comments:
>>>>>>>>>>>>> * I like the idea of having the log cleaner continue to clean as
>>>>>>>> many
>>>>>>>>>>>>> partitions as it can, skipping over the problematic ones if
>>>>>>>> possible.
>>>>>>>>>>>>> * If the log cleaner thread dies, I think it should
>>> automatically
>>>>> be
>>>>>>>>>>>>> revived. Your KIP attempts to do that by catching exceptions
>>>>> during
>>>>>>>>>>>>> execution, but I think we should go all the way and make sure
>>>>> that a
>>>>>>>>>>> new
>>>>>>>>>>>>> one gets created, if the thread ever dies.
>>>>>>>>>>>> This is inconsistent with the way the rest of Kafka works.  We
>>>>> don't
>>>>>>>>>>>> automatically re-create other threads in the broker if they
>>>>>>>> terminate.
>>>>>>>>>>> In
>>>>>>>>>>>> general, if there is a serious bug in the code, respawning
>>> threads
>>>>> is
>>>>>>>>>>>> likely to make things worse, by putting you in an infinite loop
>>>>> which
>>>>>>>>>>>> consumes resources and fires off continuous log messages.
>>>>>>>>>>>>
>>>>>>>>>>>>> * It might be worth trying to re-clean the uncleanable
>>> partitions.
>>>>>>>>>> I've
>>>>>>>>>>>>> seen cases where an uncleanable partition later became
>>> cleanable.
>>>>> I
>>>>>>>>>>>>> unfortunately don't remember how that happened, but I remember
>>>>> being
>>>>>>>>>>>>> surprised when I discovered it. It might have been something
>>> like
>>>>> a
>>>>>>>>>>>>> follower was uncleanable but after a leader election happened,
>>> the
>>>>>>>>>> log
>>>>>>>>>>>>> truncated and it was then cleanable again. I'm not sure.
>>>>>>>>>>>> James, I disagree.  We had this behavior in the Hadoop
>>> Distributed
>>>>>>>> File
>>>>>>>>>>>> System (HDFS) and it was a constant source of user problems.
>>>>>>>>>>>>
>>>>>>>>>>>> What would happen is disks would just go bad over time.  The
>>>>> DataNode
>>>>>>>>>>>> would notice this and take them offline.  But then, due to some
>>>>>>>>>>>> "optimistic" code, the DataNode would periodically try to re-add
>>>>> them
>>>>>>>>>> to
>>>>>>>>>>>> the system.  Then one of two things would happen: the disk would
>>>>> just
>>>>>>>>>>> fail
>>>>>>>>>>>> immediately again, or it would appear to work and then fail
>>> after a
>>>>>>>>>> short
>>>>>>>>>>>> amount of time.
>>>>>>>>>>>>
>>>>>>>>>>>> The way the disk failed was normally having an I/O request take a
>>>>>>>>>> really
>>>>>>>>>>>> long time and time out.  So a bunch of request handler threads
>>>>> would
>>>>>>>>>>>> basically slam into a brick wall when they tried to access the
>>> bad
>>>>>>>>>> disk,
>>>>>>>>>>>> slowing the DataNode to a crawl.  It was even worse in the second
>>>>>>>>>>> scenario,
>>>>>>>>>>>> if the disk appeared to work for a while, but then failed.  Any
>>>>> data
>>>>>>>>>> that
>>>>>>>>>>>> had been written on that DataNode to that disk would be lost, and
>>>>> we
>>>>>>>>>>> would
>>>>>>>>>>>> need to re-replicate it.
>>>>>>>>>>>>
>>>>>>>>>>>> Disks aren't biological systems-- they don't heal over time.
>>> Once
>>>>>>>>>>> they're
>>>>>>>>>>>> bad, they stay bad.  The log cleaner needs to be robust against
>>>>> cases
>>>>>>>>>>> where
>>>>>>>>>>>> the disk really is failing, and really is returning bad data or
>>>>>>>> timing
>>>>>>>>>>> out.
>>>>>>>>>>>>> * For your metrics, can you spell out the full metric in
>>> JMX-style
>>>>>>>>>>>>> format, such as:
>>>>>>>>>>>>>
>>>   kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
>>>>>>>>>>>>>                 value=4
>>>>>>>>>>>>>
>>>>>>>>>>>>> * For "uncleanable-partitions": topic-partition names can be
>>> very
>>>>>>>>>> long.
>>>>>>>>>>>>> I think the current max size is 210 characters (or maybe
>>>>> 240-ish?).
>>>>>>>>>>>>> Having the "uncleanable-partitions" being a list could be very
>>>>> large
>>>>>>>>>>>>> metric. Also, having the metric come out as a csv might be
>>>>> difficult
>>>>>>>>>> to
>>>>>>>>>>>>> work with for monitoring systems. If we *did* want the topic
>>> names
>>>>>>>> to
>>>>>>>>>>> be
>>>>>>>>>>>>> accessible, what do you think of having the
>>>>>>>>>>>>>         kafka.log:type=LogCleanerManager,topic=topic1,partition=2
>>>>>>>>>>>>> I'm not sure if LogCleanerManager is the right type, but my
>>>>> example
>>>>>>>>>> was
>>>>>>>>>>>>> that the topic and partition can be tags in the metric. That
>>> will
>>>>>>>>>> allow
>>>>>>>>>>>>> monitoring systems to more easily slice and dice the metric. I'm
>>>>> not
>>>>>>>>>>>>> sure what the attribute for that metric would be. Maybe
>>> something
>>>>>>>>>> like
>>>>>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
>>>>>>>> time-since-last-clean?
>>>>>>>>>>> Or
>>>>>>>>>>>>> maybe even just "Value=1".
>>>>>>>>>>>> I haven't though about this that hard, but do we really need the
>>>>>>>>>>>> uncleanable topic names to be accessible through a metric?  It
>>>>> seems
>>>>>>>>>> like
>>>>>>>>>>>> the admin should notice that uncleanable partitions are present,
>>>>> and
>>>>>>>>>> then
>>>>>>>>>>>> check the logs?
>>>>>>>>>>>>
>>>>>>>>>>>>> * About `max.uncleanable.partitions`, you said that this likely
>>>>>>>>>>>>> indicates that the disk is having problems. I'm not sure that is
>>>>> the
>>>>>>>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
>>>>> problems,
>>>>>>>>>>> all
>>>>>>>>>>>>> of them are partition-level scenarios that happened during
>>> normal
>>>>>>>>>>>>> operation. None of them were indicative of disk problems.
>>>>>>>>>>>> I don't think this is a meaningful comparison.  In general, we
>>>>> don't
>>>>>>>>>>>> accept JIRAs for hard disk problems that happen on a particular
>>>>>>>>>> cluster.
>>>>>>>>>>>> If someone opened a JIRA that said "my hard disk is having
>>>>> problems"
>>>>>>>> we
>>>>>>>>>>>> could close that as "not a Kafka bug."  This doesn't prove that
>>>>> disk
>>>>>>>>>>>> problems don't happen, but  just that JIRA isn't the right place
>>>>> for
>>>>>>>>>>> them.
>>>>>>>>>>>> I do agree that the log cleaner has had a significant number of
>>>>> logic
>>>>>>>>>>>> bugs, and that we need to be careful to limit their impact.
>>> That's
>>>>>>>> one
>>>>>>>>>>>> reason why I think that a threshold of "number of uncleanable
>>> logs"
>>>>>>>> is
>>>>>>>>>> a
>>>>>>>>>>>> good idea, rather than just failing after one IOException.  In
>>> all
>>>>>>>> the
>>>>>>>>>>>> cases I've seen where a user hit a logic bug in the log cleaner,
>>> it
>>>>>>>> was
>>>>>>>>>>>> just one partition that had the issue.  We also should increase
>>>>> test
>>>>>>>>>>>> coverage for the log cleaner.
>>>>>>>>>>>>
>>>>>>>>>>>>> * About marking disks as offline when exceeding a certain
>>>>> threshold,
>>>>>>>>>>>>> that actually increases the blast radius of log compaction
>>>>> failures.
>>>>>>>>>>>>> Currently, the uncleaned partitions are still readable and
>>>>> writable.
>>>>>>>>>>>>> Taking the disks offline would impact availability of the
>>>>>>>> uncleanable
>>>>>>>>>>>>> partitions, as well as impact all other partitions that are on
>>> the
>>>>>>>>>>> disk.
>>>>>>>>>>>> In general, when we encounter I/O errors, we take the disk
>>>>> partition
>>>>>>>>>>>> offline.  This is spelled out in KIP-112 (
>>>>>>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
>>>>>>>>>>>> ) :
>>>>>>>>>>>>
>>>>>>>>>>>>> - Broker assumes a log directory to be good after it starts, and
>>>>>>>> mark
>>>>>>>>>>>> log directory as
>>>>>>>>>>>>> bad once there is IOException when broker attempts to access
>>> (i.e.
>>>>>>>>>> read
>>>>>>>>>>>> or write) the log directory.
>>>>>>>>>>>>> - Broker will be offline if all log directories are bad.
>>>>>>>>>>>>> - Broker will stop serving replicas in any bad log directory.
>>> New
>>>>>>>>>>>> replicas will only be created
>>>>>>>>>>>>> on good log directory.
>>>>>>>>>>>> The behavior Stanislav is proposing for the log cleaner is
>>> actually
>>>>>>>>>> more
>>>>>>>>>>>> optimistic than what we do for regular broker I/O, since we will
>>>>>>>>>> tolerate
>>>>>>>>>>>> multiple IOExceptions, not just one.  But it's generally
>>>>> consistent.
>>>>>>>>>>>> Ignoring errors is not.  In any case, if you want to tolerate an
>>>>>>>>>>> unlimited
>>>>>>>>>>>> number of I/O errors, you can just set the threshold to an
>>> infinite
>>>>>>>>>> value
>>>>>>>>>>>> (although I think that would be a bad idea).
>>>>>>>>>>>>
>>>>>>>>>>>> best,
>>>>>>>>>>>> Colin
>>>>>>>>>>>>
>>>>>>>>>>>>> -James
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
>>>>>>>>>>>> stanislav@confluent.io> wrote:
>>>>>>>>>>>>>> I renamed the KIP and that changed the link. Sorry about that.
>>>>> Here
>>>>>>>>>>> is
>>>>>>>>>>>> the
>>>>>>>>>>>>>> new link:
>>>>>>>>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
>>>>>>>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
>>>>>>>>>>>> stanislav@confluent.io>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hey group,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I created a new KIP about making log compaction more
>>>>>>>>>> fault-tolerant.
>>>>>>>>>>>>>>> Please give it a look here and please share what you think,
>>>>>>>>>>>> especially in
>>>>>>>>>>>>>>> regards to the points in the "Needs Discussion" paragraph.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> KIP: KIP-346
>>>>>>>>>>>>>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Stanislav
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Stanislav
>>>>>>> --
>>>>>>> Best,
>>>>>>> Stanislav
>>>>>>>
>>>>>> --
>>>>>> Best,
>>>>>> Stanislav
>>>
>> -- 
>> Best,
>> Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Colin McCabe <cm...@apache.org>.

Perhaps we could start with max.uncleanable.partitions and then implement max.uncleanable.partitions.per.logdir in a follow-up change if it seemed to be necessary?  What do you think?

regards,
Colin


On Sat, Aug 4, 2018, at 10:53, Stanislav Kozlovski wrote:
> Hey Ray,
> 
> Thanks for the explanation. In regards to the configuration property - I'm
> not sure. As long as it has sufficient documentation, I find
> "max.uncleanable.partitions" to be okay. If we were to add the distinction
> explicitly, maybe it should be `max.uncleanable.partitions.per.logdir` ?
> 
> On Thu, Aug 2, 2018 at 7:32 PM Ray Chiang <rc...@apache.org> wrote:
> 
> > One more thing occurred to me.  Should the configuration property be
> > named "max.uncleanable.partitions.per.disk" instead?
> >
> > -Ray
> >
> >
> > On 8/1/18 9:11 AM, Stanislav Kozlovski wrote:
> > > Yes, good catch. Thank you, James!
> > >
> > > Best,
> > > Stanislav
> > >
> > > On Wed, Aug 1, 2018 at 5:05 PM James Cheng <wu...@gmail.com> wrote:
> > >
> > >> Can you update the KIP to say what the default is for
> > >> max.uncleanable.partitions?
> > >>
> > >> -James
> > >>
> > >> Sent from my iPhone
> > >>
> > >>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <
> > stanislav@confluent.io>
> > >> wrote:
> > >>> Hey group,
> > >>>
> > >>> I am planning on starting a voting thread tomorrow. Please do reply if
> > >> you
> > >>> feel there is anything left to discuss.
> > >>>
> > >>> Best,
> > >>> Stanislav
> > >>>
> > >>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
> > >> stanislav@confluent.io>
> > >>> wrote:
> > >>>
> > >>>> Hey, Ray
> > >>>>
> > >>>> Thanks for pointing that out, it's fixed now
> > >>>>
> > >>>> Best,
> > >>>> Stanislav
> > >>>>
> > >>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rc...@apache.org>
> > wrote:
> > >>>>>
> > >>>>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
> > >>>>> the main KIP landing page
> > >>>>> <
> > >>>>>
> > >>
> > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#
> > >>> ?
> > >>>>> I tried, but the Wiki won't let me.
> > >>>>>
> > >>>>> -Ray
> > >>>>>
> > >>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
> > >>>>>> Hey guys,
> > >>>>>>
> > >>>>>> @Colin - good point. I added some sentences mentioning recent
> > >>>>> improvements
> > >>>>>> in the introductory section.
> > >>>>>>
> > >>>>>> *Disk Failure* - I tend to agree with what Colin said - once a disk
> > >>>>> fails,
> > >>>>>> you don't want to work with it again. As such, I've changed my mind
> > >> and
> > >>>>>> believe that we should mark the LogDir (assume its a disk) as
> > offline
> > >> on
> > >>>>>> the first `IOException` encountered. This is the LogCleaner's
> > current
> > >>>>>> behavior. We shouldn't change that.
> > >>>>>>
> > >>>>>> *Respawning Threads* - I believe we should never re-spawn a thread.
> > >> The
> > >>>>>> correct approach in my mind is to either have it stay dead or never
> > >> let
> > >>>>> it
> > >>>>>> die in the first place.
> > >>>>>>
> > >>>>>> *Uncleanable-partition-names metric* - Colin is right, this metric
> > is
> > >>>>>> unneeded. Users can monitor the `uncleanable-partitions-count`
> > metric
> > >>>>> and
> > >>>>>> inspect logs.
> > >>>>>>
> > >>>>>>
> > >>>>>> Hey Ray,
> > >>>>>>
> > >>>>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner
> > to
> > >>>>>>> skip over problematic partitions instead of dying.
> > >>>>>> I think we can do this for every exception that isn't `IOException`.
> > >>>>> This
> > >>>>>> will future-proof us against bugs in the system and potential other
> > >>>>> errors.
> > >>>>>> Protecting yourself against unexpected failures is always a good
> > thing
> > >>>>> in
> > >>>>>> my mind, but I also think that protecting yourself against bugs in
> > the
> > >>>>>> software is sort of clunky. What does everybody think about this?
> > >>>>>>
> > >>>>>>> 4) The only improvement I can think of is that if such an
> > >>>>>>> error occurs, then have the option (configuration setting?) to
> > >> create a
> > >>>>>>> <log_segment>.skip file (or something similar).
> > >>>>>> This is a good suggestion. Have others also seen corruption be
> > >> generally
> > >>>>>> tied to the same segment?
> > >>>>>>
> > >>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dhruvil@confluent.io
> > >
> > >>>>> wrote:
> > >>>>>>> For the cleaner thread specifically, I do not think respawning will
> > >>>>> help at
> > >>>>>>> all because we are more than likely to run into the same issue
> > again
> > >>>>> which
> > >>>>>>> would end up crashing the cleaner. Retrying makes sense for
> > transient
> > >>>>>>> errors or when you believe some part of the system could have
> > healed
> > >>>>>>> itself, both of which I think are not true for the log cleaner.
> > >>>>>>>
> > >>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rn...@gmail.com>
> > >>>>> wrote:
> > >>>>>>>> <<<respawning threads is likely to make things worse, by putting
> > you
> > >>>>> in
> > >>>>>>> an
> > >>>>>>>> infinite loop which consumes resources and fires off continuous
> > log
> > >>>>>>>> messages.
> > >>>>>>>> Hi Colin.  In case it could be relevant, one way to mitigate this
> > >>>>> effect
> > >>>>>>> is
> > >>>>>>>> to implement a backoff mechanism (if a second respawn is to occur
> > >> then
> > >>>>>>> wait
> > >>>>>>>> for 1 minute before doing it; then if a third respawn is to occur
> > >> wait
> > >>>>>>> for
> > >>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to
> > >> some
> > >>>>> max
> > >>>>>>>> wait time).
> > >>>>>>>>
> > >>>>>>>> I have no opinion on whether respawn is appropriate or not in this
> > >>>>>>> context,
> > >>>>>>>> but a mitigation like the increasing backoff described above may
> > be
> > >>>>>>>> relevant in weighing the pros and cons.
> > >>>>>>>>
> > >>>>>>>> Ron
> > >>>>>>>>
> > >>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cm...@apache.org>
> > >>>>> wrote:
> > >>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> > >>>>>>>>>> Hi Stanislav! Thanks for this KIP!
> > >>>>>>>>>>
> > >>>>>>>>>> I agree that it would be good if the LogCleaner were more
> > tolerant
> > >>>>> of
> > >>>>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
> > >>>>>>>>>>
> > >>>>>>>>>> Things are better now than they used to be. We have the metric
> > >>>>>>>>>>
> > kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
> > >>>>>>>>>> which we can use to tell us if the threads are dead. And as of
> > >>>>> 1.1.0,
> > >>>>>>>> we
> > >>>>>>>>>> have KIP-226, which allows you to restart the log cleaner
> > thread,
> > >>>>>>>>>> without requiring a broker restart.
> > >>>>>>>>>>
> > >>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> > >>>>>>>>>> <
> > >>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> > >>>>>>>>>> I've only read about this, I haven't personally tried it.
> > >>>>>>>>> Thanks for pointing this out, James!  Stanislav, we should
> > probably
> > >>>>>>> add a
> > >>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the
> > >> KIP.
> > >>>>>>>> Maybe
> > >>>>>>>>> in the intro section?
> > >>>>>>>>>
> > >>>>>>>>> I think it's clear that requiring the users to manually restart
> > the
> > >>>>> log
> > >>>>>>>>> cleaner is not a very good solution.  But it's good to know that
> > >>>>> it's a
> > >>>>>>>>> possibility on some older releases.
> > >>>>>>>>>
> > >>>>>>>>>> Some comments:
> > >>>>>>>>>> * I like the idea of having the log cleaner continue to clean as
> > >>>>> many
> > >>>>>>>>>> partitions as it can, skipping over the problematic ones if
> > >>>>> possible.
> > >>>>>>>>>> * If the log cleaner thread dies, I think it should
> > automatically
> > >> be
> > >>>>>>>>>> revived. Your KIP attempts to do that by catching exceptions
> > >> during
> > >>>>>>>>>> execution, but I think we should go all the way and make sure
> > >> that a
> > >>>>>>>> new
> > >>>>>>>>>> one gets created, if the thread ever dies.
> > >>>>>>>>> This is inconsistent with the way the rest of Kafka works.  We
> > >> don't
> > >>>>>>>>> automatically re-create other threads in the broker if they
> > >>>>> terminate.
> > >>>>>>>> In
> > >>>>>>>>> general, if there is a serious bug in the code, respawning
> > threads
> > >> is
> > >>>>>>>>> likely to make things worse, by putting you in an infinite loop
> > >> which
> > >>>>>>>>> consumes resources and fires off continuous log messages.
> > >>>>>>>>>
> > >>>>>>>>>> * It might be worth trying to re-clean the uncleanable
> > partitions.
> > >>>>>>> I've
> > >>>>>>>>>> seen cases where an uncleanable partition later became
> > cleanable.
> > >> I
> > >>>>>>>>>> unfortunately don't remember how that happened, but I remember
> > >> being
> > >>>>>>>>>> surprised when I discovered it. It might have been something
> > like
> > >> a
> > >>>>>>>>>> follower was uncleanable but after a leader election happened,
> > the
> > >>>>>>> log
> > >>>>>>>>>> truncated and it was then cleanable again. I'm not sure.
> > >>>>>>>>> James, I disagree.  We had this behavior in the Hadoop
> > Distributed
> > >>>>> File
> > >>>>>>>>> System (HDFS) and it was a constant source of user problems.
> > >>>>>>>>>
> > >>>>>>>>> What would happen is disks would just go bad over time.  The
> > >> DataNode
> > >>>>>>>>> would notice this and take them offline.  But then, due to some
> > >>>>>>>>> "optimistic" code, the DataNode would periodically try to re-add
> > >> them
> > >>>>>>> to
> > >>>>>>>>> the system.  Then one of two things would happen: the disk would
> > >> just
> > >>>>>>>> fail
> > >>>>>>>>> immediately again, or it would appear to work and then fail
> > after a
> > >>>>>>> short
> > >>>>>>>>> amount of time.
> > >>>>>>>>>
> > >>>>>>>>> The way the disk failed was normally having an I/O request take a
> > >>>>>>> really
> > >>>>>>>>> long time and time out.  So a bunch of request handler threads
> > >> would
> > >>>>>>>>> basically slam into a brick wall when they tried to access the
> > bad
> > >>>>>>> disk,
> > >>>>>>>>> slowing the DataNode to a crawl.  It was even worse in the second
> > >>>>>>>> scenario,
> > >>>>>>>>> if the disk appeared to work for a while, but then failed.  Any
> > >> data
> > >>>>>>> that
> > >>>>>>>>> had been written on that DataNode to that disk would be lost, and
> > >> we
> > >>>>>>>> would
> > >>>>>>>>> need to re-replicate it.
> > >>>>>>>>>
> > >>>>>>>>> Disks aren't biological systems-- they don't heal over time.
> > Once
> > >>>>>>>> they're
> > >>>>>>>>> bad, they stay bad.  The log cleaner needs to be robust against
> > >> cases
> > >>>>>>>> where
> > >>>>>>>>> the disk really is failing, and really is returning bad data or
> > >>>>> timing
> > >>>>>>>> out.
> > >>>>>>>>>> * For your metrics, can you spell out the full metric in
> > JMX-style
> > >>>>>>>>>> format, such as:
> > >>>>>>>>>>
> > >>>>>>>>
> >  kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
> > >>>>>>>>>>                value=4
> > >>>>>>>>>>
> > >>>>>>>>>> * For "uncleanable-partitions": topic-partition names can be
> > very
> > >>>>>>> long.
> > >>>>>>>>>> I think the current max size is 210 characters (or maybe
> > >> 240-ish?).
> > >>>>>>>>>> Having the "uncleanable-partitions" being a list could be very
> > >> large
> > >>>>>>>>>> metric. Also, having the metric come out as a csv might be
> > >> difficult
> > >>>>>>> to
> > >>>>>>>>>> work with for monitoring systems. If we *did* want the topic
> > names
> > >>>>> to
> > >>>>>>>> be
> > >>>>>>>>>> accessible, what do you think of having the
> > >>>>>>>>>>        kafka.log:type=LogCleanerManager,topic=topic1,partition=2
> > >>>>>>>>>> I'm not sure if LogCleanerManager is the right type, but my
> > >> example
> > >>>>>>> was
> > >>>>>>>>>> that the topic and partition can be tags in the metric. That
> > will
> > >>>>>>> allow
> > >>>>>>>>>> monitoring systems to more easily slice and dice the metric. I'm
> > >> not
> > >>>>>>>>>> sure what the attribute for that metric would be. Maybe
> > something
> > >>>>>>> like
> > >>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
> > >>>>> time-since-last-clean?
> > >>>>>>>> Or
> > >>>>>>>>>> maybe even just "Value=1".
> > >>>>>>>>> I haven't though about this that hard, but do we really need the
> > >>>>>>>>> uncleanable topic names to be accessible through a metric?  It
> > >> seems
> > >>>>>>> like
> > >>>>>>>>> the admin should notice that uncleanable partitions are present,
> > >> and
> > >>>>>>> then
> > >>>>>>>>> check the logs?
> > >>>>>>>>>
> > >>>>>>>>>> * About `max.uncleanable.partitions`, you said that this likely
> > >>>>>>>>>> indicates that the disk is having problems. I'm not sure that is
> > >> the
> > >>>>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
> > >> problems,
> > >>>>>>>> all
> > >>>>>>>>>> of them are partition-level scenarios that happened during
> > normal
> > >>>>>>>>>> operation. None of them were indicative of disk problems.
> > >>>>>>>>> I don't think this is a meaningful comparison.  In general, we
> > >> don't
> > >>>>>>>>> accept JIRAs for hard disk problems that happen on a particular
> > >>>>>>> cluster.
> > >>>>>>>>> If someone opened a JIRA that said "my hard disk is having
> > >> problems"
> > >>>>> we
> > >>>>>>>>> could close that as "not a Kafka bug."  This doesn't prove that
> > >> disk
> > >>>>>>>>> problems don't happen, but  just that JIRA isn't the right place
> > >> for
> > >>>>>>>> them.
> > >>>>>>>>> I do agree that the log cleaner has had a significant number of
> > >> logic
> > >>>>>>>>> bugs, and that we need to be careful to limit their impact.
> > That's
> > >>>>> one
> > >>>>>>>>> reason why I think that a threshold of "number of uncleanable
> > logs"
> > >>>>> is
> > >>>>>>> a
> > >>>>>>>>> good idea, rather than just failing after one IOException.  In
> > all
> > >>>>> the
> > >>>>>>>>> cases I've seen where a user hit a logic bug in the log cleaner,
> > it
> > >>>>> was
> > >>>>>>>>> just one partition that had the issue.  We also should increase
> > >> test
> > >>>>>>>>> coverage for the log cleaner.
> > >>>>>>>>>
> > >>>>>>>>>> * About marking disks as offline when exceeding a certain
> > >> threshold,
> > >>>>>>>>>> that actually increases the blast radius of log compaction
> > >> failures.
> > >>>>>>>>>> Currently, the uncleaned partitions are still readable and
> > >> writable.
> > >>>>>>>>>> Taking the disks offline would impact availability of the
> > >>>>> uncleanable
> > >>>>>>>>>> partitions, as well as impact all other partitions that are on
> > the
> > >>>>>>>> disk.
> > >>>>>>>>> In general, when we encounter I/O errors, we take the disk
> > >> partition
> > >>>>>>>>> offline.  This is spelled out in KIP-112 (
> > >>>>>>>>>
> > >>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
> > >>>>>>>>> ) :
> > >>>>>>>>>
> > >>>>>>>>>> - Broker assumes a log directory to be good after it starts, and
> > >>>>> mark
> > >>>>>>>>> log directory as
> > >>>>>>>>>> bad once there is IOException when broker attempts to access
> > (i.e.
> > >>>>>>> read
> > >>>>>>>>> or write) the log directory.
> > >>>>>>>>>> - Broker will be offline if all log directories are bad.
> > >>>>>>>>>> - Broker will stop serving replicas in any bad log directory.
> > New
> > >>>>>>>>> replicas will only be created
> > >>>>>>>>>> on good log directory.
> > >>>>>>>>> The behavior Stanislav is proposing for the log cleaner is
> > actually
> > >>>>>>> more
> > >>>>>>>>> optimistic than what we do for regular broker I/O, since we will
> > >>>>>>> tolerate
> > >>>>>>>>> multiple IOExceptions, not just one.  But it's generally
> > >> consistent.
> > >>>>>>>>> Ignoring errors is not.  In any case, if you want to tolerate an
> > >>>>>>>> unlimited
> > >>>>>>>>> number of I/O errors, you can just set the threshold to an
> > infinite
> > >>>>>>> value
> > >>>>>>>>> (although I think that would be a bad idea).
> > >>>>>>>>>
> > >>>>>>>>> best,
> > >>>>>>>>> Colin
> > >>>>>>>>>
> > >>>>>>>>>> -James
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
> > >>>>>>>>> stanislav@confluent.io> wrote:
> > >>>>>>>>>>> I renamed the KIP and that changed the link. Sorry about that.
> > >> Here
> > >>>>>>>> is
> > >>>>>>>>> the
> > >>>>>>>>>>> new link:
> > >>>>>>>>>>>
> > >>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> > >>>>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
> > >>>>>>>>> stanislav@confluent.io>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hey group,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I created a new KIP about making log compaction more
> > >>>>>>> fault-tolerant.
> > >>>>>>>>>>>> Please give it a look here and please share what you think,
> > >>>>>>>>> especially in
> > >>>>>>>>>>>> regards to the points in the "Needs Discussion" paragraph.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> KIP: KIP-346
> > >>>>>>>>>>>> <
> > >>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
> > >>>>>>>>>>>> --
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Stanislav
> > >>>>>>>>>>>>
> > >>>>>>>>>>> --
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Stanislav
> > >>>>>
> > >>>> --
> > >>>> Best,
> > >>>> Stanislav
> > >>>>
> > >>>
> > >>> --
> > >>> Best,
> > >>> Stanislav
> > >
> >
> >
> 
> -- 
> Best,
> Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Stanislav Kozlovski <st...@confluent.io>.

Hey Ray,

Thanks for the explanation. In regards to the configuration property - I'm
not sure. As long as it has sufficient documentation, I find
"max.uncleanable.partitions" to be okay. If we were to add the distinction
explicitly, maybe it should be `max.uncleanable.partitions.per.logdir` ?

On Thu, Aug 2, 2018 at 7:32 PM Ray Chiang <rc...@apache.org> wrote:

> One more thing occurred to me.  Should the configuration property be
> named "max.uncleanable.partitions.per.disk" instead?
>
> -Ray
>
>
> On 8/1/18 9:11 AM, Stanislav Kozlovski wrote:
> > Yes, good catch. Thank you, James!
> >
> > Best,
> > Stanislav
> >
> > On Wed, Aug 1, 2018 at 5:05 PM James Cheng <wu...@gmail.com> wrote:
> >
> >> Can you update the KIP to say what the default is for
> >> max.uncleanable.partitions?
> >>
> >> -James
> >>
> >> Sent from my iPhone
> >>
> >>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <
> stanislav@confluent.io>
> >> wrote:
> >>> Hey group,
> >>>
> >>> I am planning on starting a voting thread tomorrow. Please do reply if
> >> you
> >>> feel there is anything left to discuss.
> >>>
> >>> Best,
> >>> Stanislav
> >>>
> >>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
> >> stanislav@confluent.io>
> >>> wrote:
> >>>
> >>>> Hey, Ray
> >>>>
> >>>> Thanks for pointing that out, it's fixed now
> >>>>
> >>>> Best,
> >>>> Stanislav
> >>>>
> >>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rc...@apache.org>
> wrote:
> >>>>>
> >>>>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
> >>>>> the main KIP landing page
> >>>>> <
> >>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#
> >>> ?
> >>>>> I tried, but the Wiki won't let me.
> >>>>>
> >>>>> -Ray
> >>>>>
> >>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
> >>>>>> Hey guys,
> >>>>>>
> >>>>>> @Colin - good point. I added some sentences mentioning recent
> >>>>> improvements
> >>>>>> in the introductory section.
> >>>>>>
> >>>>>> *Disk Failure* - I tend to agree with what Colin said - once a disk
> >>>>> fails,
> >>>>>> you don't want to work with it again. As such, I've changed my mind
> >> and
> >>>>>> believe that we should mark the LogDir (assume its a disk) as
> offline
> >> on
> >>>>>> the first `IOException` encountered. This is the LogCleaner's
> current
> >>>>>> behavior. We shouldn't change that.
> >>>>>>
> >>>>>> *Respawning Threads* - I believe we should never re-spawn a thread.
> >> The
> >>>>>> correct approach in my mind is to either have it stay dead or never
> >> let
> >>>>> it
> >>>>>> die in the first place.
> >>>>>>
> >>>>>> *Uncleanable-partition-names metric* - Colin is right, this metric
> is
> >>>>>> unneeded. Users can monitor the `uncleanable-partitions-count`
> metric
> >>>>> and
> >>>>>> inspect logs.
> >>>>>>
> >>>>>>
> >>>>>> Hey Ray,
> >>>>>>
> >>>>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner
> to
> >>>>>>> skip over problematic partitions instead of dying.
> >>>>>> I think we can do this for every exception that isn't `IOException`.
> >>>>> This
> >>>>>> will future-proof us against bugs in the system and potential other
> >>>>> errors.
> >>>>>> Protecting yourself against unexpected failures is always a good
> thing
> >>>>> in
> >>>>>> my mind, but I also think that protecting yourself against bugs in
> the
> >>>>>> software is sort of clunky. What does everybody think about this?
> >>>>>>
> >>>>>>> 4) The only improvement I can think of is that if such an
> >>>>>>> error occurs, then have the option (configuration setting?) to
> >> create a
> >>>>>>> <log_segment>.skip file (or something similar).
> >>>>>> This is a good suggestion. Have others also seen corruption be
> >> generally
> >>>>>> tied to the same segment?
> >>>>>>
> >>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dhruvil@confluent.io
> >
> >>>>> wrote:
> >>>>>>> For the cleaner thread specifically, I do not think respawning will
> >>>>> help at
> >>>>>>> all because we are more than likely to run into the same issue
> again
> >>>>> which
> >>>>>>> would end up crashing the cleaner. Retrying makes sense for
> transient
> >>>>>>> errors or when you believe some part of the system could have
> healed
> >>>>>>> itself, both of which I think are not true for the log cleaner.
> >>>>>>>
> >>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rn...@gmail.com>
> >>>>> wrote:
> >>>>>>>> <<<respawning threads is likely to make things worse, by putting
> you
> >>>>> in
> >>>>>>> an
> >>>>>>>> infinite loop which consumes resources and fires off continuous
> log
> >>>>>>>> messages.
> >>>>>>>> Hi Colin.  In case it could be relevant, one way to mitigate this
> >>>>> effect
> >>>>>>> is
> >>>>>>>> to implement a backoff mechanism (if a second respawn is to occur
> >> then
> >>>>>>> wait
> >>>>>>>> for 1 minute before doing it; then if a third respawn is to occur
> >> wait
> >>>>>>> for
> >>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to
> >> some
> >>>>> max
> >>>>>>>> wait time).
> >>>>>>>>
> >>>>>>>> I have no opinion on whether respawn is appropriate or not in this
> >>>>>>> context,
> >>>>>>>> but a mitigation like the increasing backoff described above may
> be
> >>>>>>>> relevant in weighing the pros and cons.
> >>>>>>>>
> >>>>>>>> Ron
> >>>>>>>>
> >>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cm...@apache.org>
> >>>>> wrote:
> >>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> >>>>>>>>>> Hi Stanislav! Thanks for this KIP!
> >>>>>>>>>>
> >>>>>>>>>> I agree that it would be good if the LogCleaner were more
> tolerant
> >>>>> of
> >>>>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
> >>>>>>>>>>
> >>>>>>>>>> Things are better now than they used to be. We have the metric
> >>>>>>>>>>
> kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
> >>>>>>>>>> which we can use to tell us if the threads are dead. And as of
> >>>>> 1.1.0,
> >>>>>>>> we
> >>>>>>>>>> have KIP-226, which allows you to restart the log cleaner
> thread,
> >>>>>>>>>> without requiring a broker restart.
> >>>>>>>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> >>>>>>>>>> <
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> >>>>>>>>>> I've only read about this, I haven't personally tried it.
> >>>>>>>>> Thanks for pointing this out, James!  Stanislav, we should
> probably
> >>>>>>> add a
> >>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the
> >> KIP.
> >>>>>>>> Maybe
> >>>>>>>>> in the intro section?
> >>>>>>>>>
> >>>>>>>>> I think it's clear that requiring the users to manually restart
> the
> >>>>> log
> >>>>>>>>> cleaner is not a very good solution.  But it's good to know that
> >>>>> it's a
> >>>>>>>>> possibility on some older releases.
> >>>>>>>>>
> >>>>>>>>>> Some comments:
> >>>>>>>>>> * I like the idea of having the log cleaner continue to clean as
> >>>>> many
> >>>>>>>>>> partitions as it can, skipping over the problematic ones if
> >>>>> possible.
> >>>>>>>>>> * If the log cleaner thread dies, I think it should
> automatically
> >> be
> >>>>>>>>>> revived. Your KIP attempts to do that by catching exceptions
> >> during
> >>>>>>>>>> execution, but I think we should go all the way and make sure
> >> that a
> >>>>>>>> new
> >>>>>>>>>> one gets created, if the thread ever dies.
> >>>>>>>>> This is inconsistent with the way the rest of Kafka works.  We
> >> don't
> >>>>>>>>> automatically re-create other threads in the broker if they
> >>>>> terminate.
> >>>>>>>> In
> >>>>>>>>> general, if there is a serious bug in the code, respawning
> threads
> >> is
> >>>>>>>>> likely to make things worse, by putting you in an infinite loop
> >> which
> >>>>>>>>> consumes resources and fires off continuous log messages.
> >>>>>>>>>
> >>>>>>>>>> * It might be worth trying to re-clean the uncleanable
> partitions.
> >>>>>>> I've
> >>>>>>>>>> seen cases where an uncleanable partition later became
> cleanable.
> >> I
> >>>>>>>>>> unfortunately don't remember how that happened, but I remember
> >> being
> >>>>>>>>>> surprised when I discovered it. It might have been something
> like
> >> a
> >>>>>>>>>> follower was uncleanable but after a leader election happened,
> the
> >>>>>>> log
> >>>>>>>>>> truncated and it was then cleanable again. I'm not sure.
> >>>>>>>>> James, I disagree.  We had this behavior in the Hadoop
> Distributed
> >>>>> File
> >>>>>>>>> System (HDFS) and it was a constant source of user problems.
> >>>>>>>>>
> >>>>>>>>> What would happen is disks would just go bad over time.  The
> >> DataNode
> >>>>>>>>> would notice this and take them offline.  But then, due to some
> >>>>>>>>> "optimistic" code, the DataNode would periodically try to re-add
> >> them
> >>>>>>> to
> >>>>>>>>> the system.  Then one of two things would happen: the disk would
> >> just
> >>>>>>>> fail
> >>>>>>>>> immediately again, or it would appear to work and then fail
> after a
> >>>>>>> short
> >>>>>>>>> amount of time.
> >>>>>>>>>
> >>>>>>>>> The way the disk failed was normally having an I/O request take a
> >>>>>>> really
> >>>>>>>>> long time and time out.  So a bunch of request handler threads
> >> would
> >>>>>>>>> basically slam into a brick wall when they tried to access the
> bad
> >>>>>>> disk,
> >>>>>>>>> slowing the DataNode to a crawl.  It was even worse in the second
> >>>>>>>> scenario,
> >>>>>>>>> if the disk appeared to work for a while, but then failed.  Any
> >> data
> >>>>>>> that
> >>>>>>>>> had been written on that DataNode to that disk would be lost, and
> >> we
> >>>>>>>> would
> >>>>>>>>> need to re-replicate it.
> >>>>>>>>>
> >>>>>>>>> Disks aren't biological systems-- they don't heal over time.
> Once
> >>>>>>>> they're
> >>>>>>>>> bad, they stay bad.  The log cleaner needs to be robust against
> >> cases
> >>>>>>>> where
> >>>>>>>>> the disk really is failing, and really is returning bad data or
> >>>>> timing
> >>>>>>>> out.
> >>>>>>>>>> * For your metrics, can you spell out the full metric in
> JMX-style
> >>>>>>>>>> format, such as:
> >>>>>>>>>>
> >>>>>>>>
>  kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
> >>>>>>>>>>                value=4
> >>>>>>>>>>
> >>>>>>>>>> * For "uncleanable-partitions": topic-partition names can be
> very
> >>>>>>> long.
> >>>>>>>>>> I think the current max size is 210 characters (or maybe
> >> 240-ish?).
> >>>>>>>>>> Having the "uncleanable-partitions" being a list could be very
> >> large
> >>>>>>>>>> metric. Also, having the metric come out as a csv might be
> >> difficult
> >>>>>>> to
> >>>>>>>>>> work with for monitoring systems. If we *did* want the topic
> names
> >>>>> to
> >>>>>>>> be
> >>>>>>>>>> accessible, what do you think of having the
> >>>>>>>>>>        kafka.log:type=LogCleanerManager,topic=topic1,partition=2
> >>>>>>>>>> I'm not sure if LogCleanerManager is the right type, but my
> >> example
> >>>>>>> was
> >>>>>>>>>> that the topic and partition can be tags in the metric. That
> will
> >>>>>>> allow
> >>>>>>>>>> monitoring systems to more easily slice and dice the metric. I'm
> >> not
> >>>>>>>>>> sure what the attribute for that metric would be. Maybe
> something
> >>>>>>> like
> >>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
> >>>>> time-since-last-clean?
> >>>>>>>> Or
> >>>>>>>>>> maybe even just "Value=1".
> >>>>>>>>> I haven't though about this that hard, but do we really need the
> >>>>>>>>> uncleanable topic names to be accessible through a metric?  It
> >> seems
> >>>>>>> like
> >>>>>>>>> the admin should notice that uncleanable partitions are present,
> >> and
> >>>>>>> then
> >>>>>>>>> check the logs?
> >>>>>>>>>
> >>>>>>>>>> * About `max.uncleanable.partitions`, you said that this likely
> >>>>>>>>>> indicates that the disk is having problems. I'm not sure that is
> >> the
> >>>>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
> >> problems,
> >>>>>>>> all
> >>>>>>>>>> of them are partition-level scenarios that happened during
> normal
> >>>>>>>>>> operation. None of them were indicative of disk problems.
> >>>>>>>>> I don't think this is a meaningful comparison.  In general, we
> >> don't
> >>>>>>>>> accept JIRAs for hard disk problems that happen on a particular
> >>>>>>> cluster.
> >>>>>>>>> If someone opened a JIRA that said "my hard disk is having
> >> problems"
> >>>>> we
> >>>>>>>>> could close that as "not a Kafka bug."  This doesn't prove that
> >> disk
> >>>>>>>>> problems don't happen, but  just that JIRA isn't the right place
> >> for
> >>>>>>>> them.
> >>>>>>>>> I do agree that the log cleaner has had a significant number of
> >> logic
> >>>>>>>>> bugs, and that we need to be careful to limit their impact.
> That's
> >>>>> one
> >>>>>>>>> reason why I think that a threshold of "number of uncleanable
> logs"
> >>>>> is
> >>>>>>> a
> >>>>>>>>> good idea, rather than just failing after one IOException.  In
> all
> >>>>> the
> >>>>>>>>> cases I've seen where a user hit a logic bug in the log cleaner,
> it
> >>>>> was
> >>>>>>>>> just one partition that had the issue.  We also should increase
> >> test
> >>>>>>>>> coverage for the log cleaner.
> >>>>>>>>>
> >>>>>>>>>> * About marking disks as offline when exceeding a certain
> >> threshold,
> >>>>>>>>>> that actually increases the blast radius of log compaction
> >> failures.
> >>>>>>>>>> Currently, the uncleaned partitions are still readable and
> >> writable.
> >>>>>>>>>> Taking the disks offline would impact availability of the
> >>>>> uncleanable
> >>>>>>>>>> partitions, as well as impact all other partitions that are on
> the
> >>>>>>>> disk.
> >>>>>>>>> In general, when we encounter I/O errors, we take the disk
> >> partition
> >>>>>>>>> offline.  This is spelled out in KIP-112 (
> >>>>>>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
> >>>>>>>>> ) :
> >>>>>>>>>
> >>>>>>>>>> - Broker assumes a log directory to be good after it starts, and
> >>>>> mark
> >>>>>>>>> log directory as
> >>>>>>>>>> bad once there is IOException when broker attempts to access
> (i.e.
> >>>>>>> read
> >>>>>>>>> or write) the log directory.
> >>>>>>>>>> - Broker will be offline if all log directories are bad.
> >>>>>>>>>> - Broker will stop serving replicas in any bad log directory.
> New
> >>>>>>>>> replicas will only be created
> >>>>>>>>>> on good log directory.
> >>>>>>>>> The behavior Stanislav is proposing for the log cleaner is
> actually
> >>>>>>> more
> >>>>>>>>> optimistic than what we do for regular broker I/O, since we will
> >>>>>>> tolerate
> >>>>>>>>> multiple IOExceptions, not just one.  But it's generally
> >> consistent.
> >>>>>>>>> Ignoring errors is not.  In any case, if you want to tolerate an
> >>>>>>>> unlimited
> >>>>>>>>> number of I/O errors, you can just set the threshold to an
> infinite
> >>>>>>> value
> >>>>>>>>> (although I think that would be a bad idea).
> >>>>>>>>>
> >>>>>>>>> best,
> >>>>>>>>> Colin
> >>>>>>>>>
> >>>>>>>>>> -James
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
> >>>>>>>>> stanislav@confluent.io> wrote:
> >>>>>>>>>>> I renamed the KIP and that changed the link. Sorry about that.
> >> Here
> >>>>>>>> is
> >>>>>>>>> the
> >>>>>>>>>>> new link:
> >>>>>>>>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> >>>>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
> >>>>>>>>> stanislav@confluent.io>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hey group,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I created a new KIP about making log compaction more
> >>>>>>> fault-tolerant.
> >>>>>>>>>>>> Please give it a look here and please share what you think,
> >>>>>>>>> especially in
> >>>>>>>>>>>> regards to the points in the "Needs Discussion" paragraph.
> >>>>>>>>>>>>
> >>>>>>>>>>>> KIP: KIP-346
> >>>>>>>>>>>> <
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Stanislav
> >>>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Stanislav
> >>>>>
> >>>> --
> >>>> Best,
> >>>> Stanislav
> >>>>
> >>>
> >>> --
> >>> Best,
> >>> Stanislav
> >
>
>

-- 
Best,
Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Ray Chiang <rc...@apache.org>.

One more thing occurred to me.  Should the configuration property be 
named "max.uncleanable.partitions.per.disk" instead?

-Ray


On 8/1/18 9:11 AM, Stanislav Kozlovski wrote:
> Yes, good catch. Thank you, James!
>
> Best,
> Stanislav
>
> On Wed, Aug 1, 2018 at 5:05 PM James Cheng <wu...@gmail.com> wrote:
>
>> Can you update the KIP to say what the default is for
>> max.uncleanable.partitions?
>>
>> -James
>>
>> Sent from my iPhone
>>
>>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <st...@confluent.io>
>> wrote:
>>> Hey group,
>>>
>>> I am planning on starting a voting thread tomorrow. Please do reply if
>> you
>>> feel there is anything left to discuss.
>>>
>>> Best,
>>> Stanislav
>>>
>>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
>> stanislav@confluent.io>
>>> wrote:
>>>
>>>> Hey, Ray
>>>>
>>>> Thanks for pointing that out, it's fixed now
>>>>
>>>> Best,
>>>> Stanislav
>>>>
>>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rc...@apache.org> wrote:
>>>>>
>>>>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
>>>>> the main KIP landing page
>>>>> <
>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#
>>> ?
>>>>> I tried, but the Wiki won't let me.
>>>>>
>>>>> -Ray
>>>>>
>>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
>>>>>> Hey guys,
>>>>>>
>>>>>> @Colin - good point. I added some sentences mentioning recent
>>>>> improvements
>>>>>> in the introductory section.
>>>>>>
>>>>>> *Disk Failure* - I tend to agree with what Colin said - once a disk
>>>>> fails,
>>>>>> you don't want to work with it again. As such, I've changed my mind
>> and
>>>>>> believe that we should mark the LogDir (assume its a disk) as offline
>> on
>>>>>> the first `IOException` encountered. This is the LogCleaner's current
>>>>>> behavior. We shouldn't change that.
>>>>>>
>>>>>> *Respawning Threads* - I believe we should never re-spawn a thread.
>> The
>>>>>> correct approach in my mind is to either have it stay dead or never
>> let
>>>>> it
>>>>>> die in the first place.
>>>>>>
>>>>>> *Uncleanable-partition-names metric* - Colin is right, this metric is
>>>>>> unneeded. Users can monitor the `uncleanable-partitions-count` metric
>>>>> and
>>>>>> inspect logs.
>>>>>>
>>>>>>
>>>>>> Hey Ray,
>>>>>>
>>>>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner to
>>>>>>> skip over problematic partitions instead of dying.
>>>>>> I think we can do this for every exception that isn't `IOException`.
>>>>> This
>>>>>> will future-proof us against bugs in the system and potential other
>>>>> errors.
>>>>>> Protecting yourself against unexpected failures is always a good thing
>>>>> in
>>>>>> my mind, but I also think that protecting yourself against bugs in the
>>>>>> software is sort of clunky. What does everybody think about this?
>>>>>>
>>>>>>> 4) The only improvement I can think of is that if such an
>>>>>>> error occurs, then have the option (configuration setting?) to
>> create a
>>>>>>> <log_segment>.skip file (or something similar).
>>>>>> This is a good suggestion. Have others also seen corruption be
>> generally
>>>>>> tied to the same segment?
>>>>>>
>>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dh...@confluent.io>
>>>>> wrote:
>>>>>>> For the cleaner thread specifically, I do not think respawning will
>>>>> help at
>>>>>>> all because we are more than likely to run into the same issue again
>>>>> which
>>>>>>> would end up crashing the cleaner. Retrying makes sense for transient
>>>>>>> errors or when you believe some part of the system could have healed
>>>>>>> itself, both of which I think are not true for the log cleaner.
>>>>>>>
>>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rn...@gmail.com>
>>>>> wrote:
>>>>>>>> <<<respawning threads is likely to make things worse, by putting you
>>>>> in
>>>>>>> an
>>>>>>>> infinite loop which consumes resources and fires off continuous log
>>>>>>>> messages.
>>>>>>>> Hi Colin.  In case it could be relevant, one way to mitigate this
>>>>> effect
>>>>>>> is
>>>>>>>> to implement a backoff mechanism (if a second respawn is to occur
>> then
>>>>>>> wait
>>>>>>>> for 1 minute before doing it; then if a third respawn is to occur
>> wait
>>>>>>> for
>>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to
>> some
>>>>> max
>>>>>>>> wait time).
>>>>>>>>
>>>>>>>> I have no opinion on whether respawn is appropriate or not in this
>>>>>>> context,
>>>>>>>> but a mitigation like the increasing backoff described above may be
>>>>>>>> relevant in weighing the pros and cons.
>>>>>>>>
>>>>>>>> Ron
>>>>>>>>
>>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cm...@apache.org>
>>>>> wrote:
>>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
>>>>>>>>>> Hi Stanislav! Thanks for this KIP!
>>>>>>>>>>
>>>>>>>>>> I agree that it would be good if the LogCleaner were more tolerant
>>>>> of
>>>>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
>>>>>>>>>>
>>>>>>>>>> Things are better now than they used to be. We have the metric
>>>>>>>>>>        kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
>>>>>>>>>> which we can use to tell us if the threads are dead. And as of
>>>>> 1.1.0,
>>>>>>>> we
>>>>>>>>>> have KIP-226, which allows you to restart the log cleaner thread,
>>>>>>>>>> without requiring a broker restart.
>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>>>>>>>>>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>>>>>>>>>> I've only read about this, I haven't personally tried it.
>>>>>>>>> Thanks for pointing this out, James!  Stanislav, we should probably
>>>>>>> add a
>>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the
>> KIP.
>>>>>>>> Maybe
>>>>>>>>> in the intro section?
>>>>>>>>>
>>>>>>>>> I think it's clear that requiring the users to manually restart the
>>>>> log
>>>>>>>>> cleaner is not a very good solution.  But it's good to know that
>>>>> it's a
>>>>>>>>> possibility on some older releases.
>>>>>>>>>
>>>>>>>>>> Some comments:
>>>>>>>>>> * I like the idea of having the log cleaner continue to clean as
>>>>> many
>>>>>>>>>> partitions as it can, skipping over the problematic ones if
>>>>> possible.
>>>>>>>>>> * If the log cleaner thread dies, I think it should automatically
>> be
>>>>>>>>>> revived. Your KIP attempts to do that by catching exceptions
>> during
>>>>>>>>>> execution, but I think we should go all the way and make sure
>> that a
>>>>>>>> new
>>>>>>>>>> one gets created, if the thread ever dies.
>>>>>>>>> This is inconsistent with the way the rest of Kafka works.  We
>> don't
>>>>>>>>> automatically re-create other threads in the broker if they
>>>>> terminate.
>>>>>>>> In
>>>>>>>>> general, if there is a serious bug in the code, respawning threads
>> is
>>>>>>>>> likely to make things worse, by putting you in an infinite loop
>> which
>>>>>>>>> consumes resources and fires off continuous log messages.
>>>>>>>>>
>>>>>>>>>> * It might be worth trying to re-clean the uncleanable partitions.
>>>>>>> I've
>>>>>>>>>> seen cases where an uncleanable partition later became cleanable.
>> I
>>>>>>>>>> unfortunately don't remember how that happened, but I remember
>> being
>>>>>>>>>> surprised when I discovered it. It might have been something like
>> a
>>>>>>>>>> follower was uncleanable but after a leader election happened, the
>>>>>>> log
>>>>>>>>>> truncated and it was then cleanable again. I'm not sure.
>>>>>>>>> James, I disagree.  We had this behavior in the Hadoop Distributed
>>>>> File
>>>>>>>>> System (HDFS) and it was a constant source of user problems.
>>>>>>>>>
>>>>>>>>> What would happen is disks would just go bad over time.  The
>> DataNode
>>>>>>>>> would notice this and take them offline.  But then, due to some
>>>>>>>>> "optimistic" code, the DataNode would periodically try to re-add
>> them
>>>>>>> to
>>>>>>>>> the system.  Then one of two things would happen: the disk would
>> just
>>>>>>>> fail
>>>>>>>>> immediately again, or it would appear to work and then fail after a
>>>>>>> short
>>>>>>>>> amount of time.
>>>>>>>>>
>>>>>>>>> The way the disk failed was normally having an I/O request take a
>>>>>>> really
>>>>>>>>> long time and time out.  So a bunch of request handler threads
>> would
>>>>>>>>> basically slam into a brick wall when they tried to access the bad
>>>>>>> disk,
>>>>>>>>> slowing the DataNode to a crawl.  It was even worse in the second
>>>>>>>> scenario,
>>>>>>>>> if the disk appeared to work for a while, but then failed.  Any
>> data
>>>>>>> that
>>>>>>>>> had been written on that DataNode to that disk would be lost, and
>> we
>>>>>>>> would
>>>>>>>>> need to re-replicate it.
>>>>>>>>>
>>>>>>>>> Disks aren't biological systems-- they don't heal over time.  Once
>>>>>>>> they're
>>>>>>>>> bad, they stay bad.  The log cleaner needs to be robust against
>> cases
>>>>>>>> where
>>>>>>>>> the disk really is failing, and really is returning bad data or
>>>>> timing
>>>>>>>> out.
>>>>>>>>>> * For your metrics, can you spell out the full metric in JMX-style
>>>>>>>>>> format, such as:
>>>>>>>>>>
>>>>>>>>   kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
>>>>>>>>>>                value=4
>>>>>>>>>>
>>>>>>>>>> * For "uncleanable-partitions": topic-partition names can be very
>>>>>>> long.
>>>>>>>>>> I think the current max size is 210 characters (or maybe
>> 240-ish?).
>>>>>>>>>> Having the "uncleanable-partitions" being a list could be very
>> large
>>>>>>>>>> metric. Also, having the metric come out as a csv might be
>> difficult
>>>>>>> to
>>>>>>>>>> work with for monitoring systems. If we *did* want the topic names
>>>>> to
>>>>>>>> be
>>>>>>>>>> accessible, what do you think of having the
>>>>>>>>>>        kafka.log:type=LogCleanerManager,topic=topic1,partition=2
>>>>>>>>>> I'm not sure if LogCleanerManager is the right type, but my
>> example
>>>>>>> was
>>>>>>>>>> that the topic and partition can be tags in the metric. That will
>>>>>>> allow
>>>>>>>>>> monitoring systems to more easily slice and dice the metric. I'm
>> not
>>>>>>>>>> sure what the attribute for that metric would be. Maybe something
>>>>>>> like
>>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
>>>>> time-since-last-clean?
>>>>>>>> Or
>>>>>>>>>> maybe even just "Value=1".
>>>>>>>>> I haven't though about this that hard, but do we really need the
>>>>>>>>> uncleanable topic names to be accessible through a metric?  It
>> seems
>>>>>>> like
>>>>>>>>> the admin should notice that uncleanable partitions are present,
>> and
>>>>>>> then
>>>>>>>>> check the logs?
>>>>>>>>>
>>>>>>>>>> * About `max.uncleanable.partitions`, you said that this likely
>>>>>>>>>> indicates that the disk is having problems. I'm not sure that is
>> the
>>>>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
>> problems,
>>>>>>>> all
>>>>>>>>>> of them are partition-level scenarios that happened during normal
>>>>>>>>>> operation. None of them were indicative of disk problems.
>>>>>>>>> I don't think this is a meaningful comparison.  In general, we
>> don't
>>>>>>>>> accept JIRAs for hard disk problems that happen on a particular
>>>>>>> cluster.
>>>>>>>>> If someone opened a JIRA that said "my hard disk is having
>> problems"
>>>>> we
>>>>>>>>> could close that as "not a Kafka bug."  This doesn't prove that
>> disk
>>>>>>>>> problems don't happen, but  just that JIRA isn't the right place
>> for
>>>>>>>> them.
>>>>>>>>> I do agree that the log cleaner has had a significant number of
>> logic
>>>>>>>>> bugs, and that we need to be careful to limit their impact.  That's
>>>>> one
>>>>>>>>> reason why I think that a threshold of "number of uncleanable logs"
>>>>> is
>>>>>>> a
>>>>>>>>> good idea, rather than just failing after one IOException.  In all
>>>>> the
>>>>>>>>> cases I've seen where a user hit a logic bug in the log cleaner, it
>>>>> was
>>>>>>>>> just one partition that had the issue.  We also should increase
>> test
>>>>>>>>> coverage for the log cleaner.
>>>>>>>>>
>>>>>>>>>> * About marking disks as offline when exceeding a certain
>> threshold,
>>>>>>>>>> that actually increases the blast radius of log compaction
>> failures.
>>>>>>>>>> Currently, the uncleaned partitions are still readable and
>> writable.
>>>>>>>>>> Taking the disks offline would impact availability of the
>>>>> uncleanable
>>>>>>>>>> partitions, as well as impact all other partitions that are on the
>>>>>>>> disk.
>>>>>>>>> In general, when we encounter I/O errors, we take the disk
>> partition
>>>>>>>>> offline.  This is spelled out in KIP-112 (
>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
>>>>>>>>> ) :
>>>>>>>>>
>>>>>>>>>> - Broker assumes a log directory to be good after it starts, and
>>>>> mark
>>>>>>>>> log directory as
>>>>>>>>>> bad once there is IOException when broker attempts to access (i.e.
>>>>>>> read
>>>>>>>>> or write) the log directory.
>>>>>>>>>> - Broker will be offline if all log directories are bad.
>>>>>>>>>> - Broker will stop serving replicas in any bad log directory. New
>>>>>>>>> replicas will only be created
>>>>>>>>>> on good log directory.
>>>>>>>>> The behavior Stanislav is proposing for the log cleaner is actually
>>>>>>> more
>>>>>>>>> optimistic than what we do for regular broker I/O, since we will
>>>>>>> tolerate
>>>>>>>>> multiple IOExceptions, not just one.  But it's generally
>> consistent.
>>>>>>>>> Ignoring errors is not.  In any case, if you want to tolerate an
>>>>>>>> unlimited
>>>>>>>>> number of I/O errors, you can just set the threshold to an infinite
>>>>>>> value
>>>>>>>>> (although I think that would be a bad idea).
>>>>>>>>>
>>>>>>>>> best,
>>>>>>>>> Colin
>>>>>>>>>
>>>>>>>>>> -James
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
>>>>>>>>> stanislav@confluent.io> wrote:
>>>>>>>>>>> I renamed the KIP and that changed the link. Sorry about that.
>> Here
>>>>>>>> is
>>>>>>>>> the
>>>>>>>>>>> new link:
>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
>>>>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
>>>>>>>>> stanislav@confluent.io>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey group,
>>>>>>>>>>>>
>>>>>>>>>>>> I created a new KIP about making log compaction more
>>>>>>> fault-tolerant.
>>>>>>>>>>>> Please give it a look here and please share what you think,
>>>>>>>>> especially in
>>>>>>>>>>>> regards to the points in the "Needs Discussion" paragraph.
>>>>>>>>>>>>
>>>>>>>>>>>> KIP: KIP-346
>>>>>>>>>>>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
>>>>>>>>>>>> --
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Stanislav
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Best,
>>>>>>>>>>> Stanislav
>>>>>
>>>> --
>>>> Best,
>>>> Stanislav
>>>>
>>>
>>> --
>>> Best,
>>> Stanislav
>

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Stanislav Kozlovski <st...@confluent.io>.

Yes, good catch. Thank you, James!

Best,
Stanislav

On Wed, Aug 1, 2018 at 5:05 PM James Cheng <wu...@gmail.com> wrote:

> Can you update the KIP to say what the default is for
> max.uncleanable.partitions?
>
> -James
>
> Sent from my iPhone
>
> > On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <st...@confluent.io>
> wrote:
> >
> > Hey group,
> >
> > I am planning on starting a voting thread tomorrow. Please do reply if
> you
> > feel there is anything left to discuss.
> >
> > Best,
> > Stanislav
> >
> > On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
> stanislav@confluent.io>
> > wrote:
> >
> >> Hey, Ray
> >>
> >> Thanks for pointing that out, it's fixed now
> >>
> >> Best,
> >> Stanislav
> >>
> >>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rc...@apache.org> wrote:
> >>>
> >>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
> >>> the main KIP landing page
> >>> <
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#
> >?
> >>>
> >>> I tried, but the Wiki won't let me.
> >>>
> >>> -Ray
> >>>
> >>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
> >>>> Hey guys,
> >>>>
> >>>> @Colin - good point. I added some sentences mentioning recent
> >>> improvements
> >>>> in the introductory section.
> >>>>
> >>>> *Disk Failure* - I tend to agree with what Colin said - once a disk
> >>> fails,
> >>>> you don't want to work with it again. As such, I've changed my mind
> and
> >>>> believe that we should mark the LogDir (assume its a disk) as offline
> on
> >>>> the first `IOException` encountered. This is the LogCleaner's current
> >>>> behavior. We shouldn't change that.
> >>>>
> >>>> *Respawning Threads* - I believe we should never re-spawn a thread.
> The
> >>>> correct approach in my mind is to either have it stay dead or never
> let
> >>> it
> >>>> die in the first place.
> >>>>
> >>>> *Uncleanable-partition-names metric* - Colin is right, this metric is
> >>>> unneeded. Users can monitor the `uncleanable-partitions-count` metric
> >>> and
> >>>> inspect logs.
> >>>>
> >>>>
> >>>> Hey Ray,
> >>>>
> >>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner to
> >>>>> skip over problematic partitions instead of dying.
> >>>> I think we can do this for every exception that isn't `IOException`.
> >>> This
> >>>> will future-proof us against bugs in the system and potential other
> >>> errors.
> >>>> Protecting yourself against unexpected failures is always a good thing
> >>> in
> >>>> my mind, but I also think that protecting yourself against bugs in the
> >>>> software is sort of clunky. What does everybody think about this?
> >>>>
> >>>>> 4) The only improvement I can think of is that if such an
> >>>>> error occurs, then have the option (configuration setting?) to
> create a
> >>>>> <log_segment>.skip file (or something similar).
> >>>> This is a good suggestion. Have others also seen corruption be
> generally
> >>>> tied to the same segment?
> >>>>
> >>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dh...@confluent.io>
> >>> wrote:
> >>>>
> >>>>> For the cleaner thread specifically, I do not think respawning will
> >>> help at
> >>>>> all because we are more than likely to run into the same issue again
> >>> which
> >>>>> would end up crashing the cleaner. Retrying makes sense for transient
> >>>>> errors or when you believe some part of the system could have healed
> >>>>> itself, both of which I think are not true for the log cleaner.
> >>>>>
> >>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rn...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> <<<respawning threads is likely to make things worse, by putting you
> >>> in
> >>>>> an
> >>>>>> infinite loop which consumes resources and fires off continuous log
> >>>>>> messages.
> >>>>>> Hi Colin.  In case it could be relevant, one way to mitigate this
> >>> effect
> >>>>> is
> >>>>>> to implement a backoff mechanism (if a second respawn is to occur
> then
> >>>>> wait
> >>>>>> for 1 minute before doing it; then if a third respawn is to occur
> wait
> >>>>> for
> >>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to
> some
> >>> max
> >>>>>> wait time).
> >>>>>>
> >>>>>> I have no opinion on whether respawn is appropriate or not in this
> >>>>> context,
> >>>>>> but a mitigation like the increasing backoff described above may be
> >>>>>> relevant in weighing the pros and cons.
> >>>>>>
> >>>>>> Ron
> >>>>>>
> >>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cm...@apache.org>
> >>> wrote:
> >>>>>>
> >>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> >>>>>>>> Hi Stanislav! Thanks for this KIP!
> >>>>>>>>
> >>>>>>>> I agree that it would be good if the LogCleaner were more tolerant
> >>> of
> >>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
> >>>>>>>>
> >>>>>>>> Things are better now than they used to be. We have the metric
> >>>>>>>>       kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
> >>>>>>>> which we can use to tell us if the threads are dead. And as of
> >>> 1.1.0,
> >>>>>> we
> >>>>>>>> have KIP-226, which allows you to restart the log cleaner thread,
> >>>>>>>> without requiring a broker restart.
> >>>>>>>>
> >>>>>
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> >>>>>>>> <
> >>>>>
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> >>>>>>>
> >>>>>>>> I've only read about this, I haven't personally tried it.
> >>>>>>> Thanks for pointing this out, James!  Stanislav, we should probably
> >>>>> add a
> >>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the
> KIP.
> >>>>>> Maybe
> >>>>>>> in the intro section?
> >>>>>>>
> >>>>>>> I think it's clear that requiring the users to manually restart the
> >>> log
> >>>>>>> cleaner is not a very good solution.  But it's good to know that
> >>> it's a
> >>>>>>> possibility on some older releases.
> >>>>>>>
> >>>>>>>> Some comments:
> >>>>>>>> * I like the idea of having the log cleaner continue to clean as
> >>> many
> >>>>>>>> partitions as it can, skipping over the problematic ones if
> >>> possible.
> >>>>>>>>
> >>>>>>>> * If the log cleaner thread dies, I think it should automatically
> be
> >>>>>>>> revived. Your KIP attempts to do that by catching exceptions
> during
> >>>>>>>> execution, but I think we should go all the way and make sure
> that a
> >>>>>> new
> >>>>>>>> one gets created, if the thread ever dies.
> >>>>>>> This is inconsistent with the way the rest of Kafka works.  We
> don't
> >>>>>>> automatically re-create other threads in the broker if they
> >>> terminate.
> >>>>>> In
> >>>>>>> general, if there is a serious bug in the code, respawning threads
> is
> >>>>>>> likely to make things worse, by putting you in an infinite loop
> which
> >>>>>>> consumes resources and fires off continuous log messages.
> >>>>>>>
> >>>>>>>> * It might be worth trying to re-clean the uncleanable partitions.
> >>>>> I've
> >>>>>>>> seen cases where an uncleanable partition later became cleanable.
> I
> >>>>>>>> unfortunately don't remember how that happened, but I remember
> being
> >>>>>>>> surprised when I discovered it. It might have been something like
> a
> >>>>>>>> follower was uncleanable but after a leader election happened, the
> >>>>> log
> >>>>>>>> truncated and it was then cleanable again. I'm not sure.
> >>>>>>> James, I disagree.  We had this behavior in the Hadoop Distributed
> >>> File
> >>>>>>> System (HDFS) and it was a constant source of user problems.
> >>>>>>>
> >>>>>>> What would happen is disks would just go bad over time.  The
> DataNode
> >>>>>>> would notice this and take them offline.  But then, due to some
> >>>>>>> "optimistic" code, the DataNode would periodically try to re-add
> them
> >>>>> to
> >>>>>>> the system.  Then one of two things would happen: the disk would
> just
> >>>>>> fail
> >>>>>>> immediately again, or it would appear to work and then fail after a
> >>>>> short
> >>>>>>> amount of time.
> >>>>>>>
> >>>>>>> The way the disk failed was normally having an I/O request take a
> >>>>> really
> >>>>>>> long time and time out.  So a bunch of request handler threads
> would
> >>>>>>> basically slam into a brick wall when they tried to access the bad
> >>>>> disk,
> >>>>>>> slowing the DataNode to a crawl.  It was even worse in the second
> >>>>>> scenario,
> >>>>>>> if the disk appeared to work for a while, but then failed.  Any
> data
> >>>>> that
> >>>>>>> had been written on that DataNode to that disk would be lost, and
> we
> >>>>>> would
> >>>>>>> need to re-replicate it.
> >>>>>>>
> >>>>>>> Disks aren't biological systems-- they don't heal over time.  Once
> >>>>>> they're
> >>>>>>> bad, they stay bad.  The log cleaner needs to be robust against
> cases
> >>>>>> where
> >>>>>>> the disk really is failing, and really is returning bad data or
> >>> timing
> >>>>>> out.
> >>>>>>>> * For your metrics, can you spell out the full metric in JMX-style
> >>>>>>>> format, such as:
> >>>>>>>>
> >>>>>>  kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
> >>>>>>>>               value=4
> >>>>>>>>
> >>>>>>>> * For "uncleanable-partitions": topic-partition names can be very
> >>>>> long.
> >>>>>>>> I think the current max size is 210 characters (or maybe
> 240-ish?).
> >>>>>>>> Having the "uncleanable-partitions" being a list could be very
> large
> >>>>>>>> metric. Also, having the metric come out as a csv might be
> difficult
> >>>>> to
> >>>>>>>> work with for monitoring systems. If we *did* want the topic names
> >>> to
> >>>>>> be
> >>>>>>>> accessible, what do you think of having the
> >>>>>>>>       kafka.log:type=LogCleanerManager,topic=topic1,partition=2
> >>>>>>>> I'm not sure if LogCleanerManager is the right type, but my
> example
> >>>>> was
> >>>>>>>> that the topic and partition can be tags in the metric. That will
> >>>>> allow
> >>>>>>>> monitoring systems to more easily slice and dice the metric. I'm
> not
> >>>>>>>> sure what the attribute for that metric would be. Maybe something
> >>>>> like
> >>>>>>>> "uncleaned bytes" for that topic-partition? Or
> >>> time-since-last-clean?
> >>>>>> Or
> >>>>>>>> maybe even just "Value=1".
> >>>>>>> I haven't though about this that hard, but do we really need the
> >>>>>>> uncleanable topic names to be accessible through a metric?  It
> seems
> >>>>> like
> >>>>>>> the admin should notice that uncleanable partitions are present,
> and
> >>>>> then
> >>>>>>> check the logs?
> >>>>>>>
> >>>>>>>> * About `max.uncleanable.partitions`, you said that this likely
> >>>>>>>> indicates that the disk is having problems. I'm not sure that is
> the
> >>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
> problems,
> >>>>>> all
> >>>>>>>> of them are partition-level scenarios that happened during normal
> >>>>>>>> operation. None of them were indicative of disk problems.
> >>>>>>> I don't think this is a meaningful comparison.  In general, we
> don't
> >>>>>>> accept JIRAs for hard disk problems that happen on a particular
> >>>>> cluster.
> >>>>>>> If someone opened a JIRA that said "my hard disk is having
> problems"
> >>> we
> >>>>>>> could close that as "not a Kafka bug."  This doesn't prove that
> disk
> >>>>>>> problems don't happen, but  just that JIRA isn't the right place
> for
> >>>>>> them.
> >>>>>>> I do agree that the log cleaner has had a significant number of
> logic
> >>>>>>> bugs, and that we need to be careful to limit their impact.  That's
> >>> one
> >>>>>>> reason why I think that a threshold of "number of uncleanable logs"
> >>> is
> >>>>> a
> >>>>>>> good idea, rather than just failing after one IOException.  In all
> >>> the
> >>>>>>> cases I've seen where a user hit a logic bug in the log cleaner, it
> >>> was
> >>>>>>> just one partition that had the issue.  We also should increase
> test
> >>>>>>> coverage for the log cleaner.
> >>>>>>>
> >>>>>>>> * About marking disks as offline when exceeding a certain
> threshold,
> >>>>>>>> that actually increases the blast radius of log compaction
> failures.
> >>>>>>>> Currently, the uncleaned partitions are still readable and
> writable.
> >>>>>>>> Taking the disks offline would impact availability of the
> >>> uncleanable
> >>>>>>>> partitions, as well as impact all other partitions that are on the
> >>>>>> disk.
> >>>>>>> In general, when we encounter I/O errors, we take the disk
> partition
> >>>>>>> offline.  This is spelled out in KIP-112 (
> >>>>>>>
> >>>>>
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
> >>>>>>> ) :
> >>>>>>>
> >>>>>>>> - Broker assumes a log directory to be good after it starts, and
> >>> mark
> >>>>>>> log directory as
> >>>>>>>> bad once there is IOException when broker attempts to access (i.e.
> >>>>> read
> >>>>>>> or write) the log directory.
> >>>>>>>> - Broker will be offline if all log directories are bad.
> >>>>>>>> - Broker will stop serving replicas in any bad log directory. New
> >>>>>>> replicas will only be created
> >>>>>>>> on good log directory.
> >>>>>>> The behavior Stanislav is proposing for the log cleaner is actually
> >>>>> more
> >>>>>>> optimistic than what we do for regular broker I/O, since we will
> >>>>> tolerate
> >>>>>>> multiple IOExceptions, not just one.  But it's generally
> consistent.
> >>>>>>> Ignoring errors is not.  In any case, if you want to tolerate an
> >>>>>> unlimited
> >>>>>>> number of I/O errors, you can just set the threshold to an infinite
> >>>>> value
> >>>>>>> (although I think that would be a bad idea).
> >>>>>>>
> >>>>>>> best,
> >>>>>>> Colin
> >>>>>>>
> >>>>>>>> -James
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
> >>>>>>> stanislav@confluent.io> wrote:
> >>>>>>>>> I renamed the KIP and that changed the link. Sorry about that.
> Here
> >>>>>> is
> >>>>>>> the
> >>>>>>>>> new link:
> >>>>>>>>>
> >>>>>
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> >>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
> >>>>>>> stanislav@confluent.io>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hey group,
> >>>>>>>>>>
> >>>>>>>>>> I created a new KIP about making log compaction more
> >>>>> fault-tolerant.
> >>>>>>>>>> Please give it a look here and please share what you think,
> >>>>>>> especially in
> >>>>>>>>>> regards to the points in the "Needs Discussion" paragraph.
> >>>>>>>>>>
> >>>>>>>>>> KIP: KIP-346
> >>>>>>>>>> <
> >>>>>
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
> >>>>>>>>>> --
> >>>>>>>>>> Best,
> >>>>>>>>>> Stanislav
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Best,
> >>>>>>>>> Stanislav
> >>>>
> >>>
> >>>
> >>
> >> --
> >> Best,
> >> Stanislav
> >>
> >
> >
> > --
> > Best,
> > Stanislav
>


-- 
Best,
Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by James Cheng <wu...@gmail.com>.

Can you update the KIP to say what the default is for max.uncleanable.partitions?

-James

Sent from my iPhone

> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <st...@confluent.io> wrote:
> 
> Hey group,
> 
> I am planning on starting a voting thread tomorrow. Please do reply if you
> feel there is anything left to discuss.
> 
> Best,
> Stanislav
> 
> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <st...@confluent.io>
> wrote:
> 
>> Hey, Ray
>> 
>> Thanks for pointing that out, it's fixed now
>> 
>> Best,
>> Stanislav
>> 
>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rc...@apache.org> wrote:
>>> 
>>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
>>> the main KIP landing page
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#>?
>>> 
>>> I tried, but the Wiki won't let me.
>>> 
>>> -Ray
>>> 
>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
>>>> Hey guys,
>>>> 
>>>> @Colin - good point. I added some sentences mentioning recent
>>> improvements
>>>> in the introductory section.
>>>> 
>>>> *Disk Failure* - I tend to agree with what Colin said - once a disk
>>> fails,
>>>> you don't want to work with it again. As such, I've changed my mind and
>>>> believe that we should mark the LogDir (assume its a disk) as offline on
>>>> the first `IOException` encountered. This is the LogCleaner's current
>>>> behavior. We shouldn't change that.
>>>> 
>>>> *Respawning Threads* - I believe we should never re-spawn a thread. The
>>>> correct approach in my mind is to either have it stay dead or never let
>>> it
>>>> die in the first place.
>>>> 
>>>> *Uncleanable-partition-names metric* - Colin is right, this metric is
>>>> unneeded. Users can monitor the `uncleanable-partitions-count` metric
>>> and
>>>> inspect logs.
>>>> 
>>>> 
>>>> Hey Ray,
>>>> 
>>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner to
>>>>> skip over problematic partitions instead of dying.
>>>> I think we can do this for every exception that isn't `IOException`.
>>> This
>>>> will future-proof us against bugs in the system and potential other
>>> errors.
>>>> Protecting yourself against unexpected failures is always a good thing
>>> in
>>>> my mind, but I also think that protecting yourself against bugs in the
>>>> software is sort of clunky. What does everybody think about this?
>>>> 
>>>>> 4) The only improvement I can think of is that if such an
>>>>> error occurs, then have the option (configuration setting?) to create a
>>>>> <log_segment>.skip file (or something similar).
>>>> This is a good suggestion. Have others also seen corruption be generally
>>>> tied to the same segment?
>>>> 
>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dh...@confluent.io>
>>> wrote:
>>>> 
>>>>> For the cleaner thread specifically, I do not think respawning will
>>> help at
>>>>> all because we are more than likely to run into the same issue again
>>> which
>>>>> would end up crashing the cleaner. Retrying makes sense for transient
>>>>> errors or when you believe some part of the system could have healed
>>>>> itself, both of which I think are not true for the log cleaner.
>>>>> 
>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rn...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> <<<respawning threads is likely to make things worse, by putting you
>>> in
>>>>> an
>>>>>> infinite loop which consumes resources and fires off continuous log
>>>>>> messages.
>>>>>> Hi Colin.  In case it could be relevant, one way to mitigate this
>>> effect
>>>>> is
>>>>>> to implement a backoff mechanism (if a second respawn is to occur then
>>>>> wait
>>>>>> for 1 minute before doing it; then if a third respawn is to occur wait
>>>>> for
>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to some
>>> max
>>>>>> wait time).
>>>>>> 
>>>>>> I have no opinion on whether respawn is appropriate or not in this
>>>>> context,
>>>>>> but a mitigation like the increasing backoff described above may be
>>>>>> relevant in weighing the pros and cons.
>>>>>> 
>>>>>> Ron
>>>>>> 
>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cm...@apache.org>
>>> wrote:
>>>>>> 
>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
>>>>>>>> Hi Stanislav! Thanks for this KIP!
>>>>>>>> 
>>>>>>>> I agree that it would be good if the LogCleaner were more tolerant
>>> of
>>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
>>>>>>>> 
>>>>>>>> Things are better now than they used to be. We have the metric
>>>>>>>>       kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
>>>>>>>> which we can use to tell us if the threads are dead. And as of
>>> 1.1.0,
>>>>>> we
>>>>>>>> have KIP-226, which allows you to restart the log cleaner thread,
>>>>>>>> without requiring a broker restart.
>>>>>>>> 
>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>>>>>>>> <
>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>>>>>>> 
>>>>>>>> I've only read about this, I haven't personally tried it.
>>>>>>> Thanks for pointing this out, James!  Stanislav, we should probably
>>>>> add a
>>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the KIP.
>>>>>> Maybe
>>>>>>> in the intro section?
>>>>>>> 
>>>>>>> I think it's clear that requiring the users to manually restart the
>>> log
>>>>>>> cleaner is not a very good solution.  But it's good to know that
>>> it's a
>>>>>>> possibility on some older releases.
>>>>>>> 
>>>>>>>> Some comments:
>>>>>>>> * I like the idea of having the log cleaner continue to clean as
>>> many
>>>>>>>> partitions as it can, skipping over the problematic ones if
>>> possible.
>>>>>>>> 
>>>>>>>> * If the log cleaner thread dies, I think it should automatically be
>>>>>>>> revived. Your KIP attempts to do that by catching exceptions during
>>>>>>>> execution, but I think we should go all the way and make sure that a
>>>>>> new
>>>>>>>> one gets created, if the thread ever dies.
>>>>>>> This is inconsistent with the way the rest of Kafka works.  We don't
>>>>>>> automatically re-create other threads in the broker if they
>>> terminate.
>>>>>> In
>>>>>>> general, if there is a serious bug in the code, respawning threads is
>>>>>>> likely to make things worse, by putting you in an infinite loop which
>>>>>>> consumes resources and fires off continuous log messages.
>>>>>>> 
>>>>>>>> * It might be worth trying to re-clean the uncleanable partitions.
>>>>> I've
>>>>>>>> seen cases where an uncleanable partition later became cleanable. I
>>>>>>>> unfortunately don't remember how that happened, but I remember being
>>>>>>>> surprised when I discovered it. It might have been something like a
>>>>>>>> follower was uncleanable but after a leader election happened, the
>>>>> log
>>>>>>>> truncated and it was then cleanable again. I'm not sure.
>>>>>>> James, I disagree.  We had this behavior in the Hadoop Distributed
>>> File
>>>>>>> System (HDFS) and it was a constant source of user problems.
>>>>>>> 
>>>>>>> What would happen is disks would just go bad over time.  The DataNode
>>>>>>> would notice this and take them offline.  But then, due to some
>>>>>>> "optimistic" code, the DataNode would periodically try to re-add them
>>>>> to
>>>>>>> the system.  Then one of two things would happen: the disk would just
>>>>>> fail
>>>>>>> immediately again, or it would appear to work and then fail after a
>>>>> short
>>>>>>> amount of time.
>>>>>>> 
>>>>>>> The way the disk failed was normally having an I/O request take a
>>>>> really
>>>>>>> long time and time out.  So a bunch of request handler threads would
>>>>>>> basically slam into a brick wall when they tried to access the bad
>>>>> disk,
>>>>>>> slowing the DataNode to a crawl.  It was even worse in the second
>>>>>> scenario,
>>>>>>> if the disk appeared to work for a while, but then failed.  Any data
>>>>> that
>>>>>>> had been written on that DataNode to that disk would be lost, and we
>>>>>> would
>>>>>>> need to re-replicate it.
>>>>>>> 
>>>>>>> Disks aren't biological systems-- they don't heal over time.  Once
>>>>>> they're
>>>>>>> bad, they stay bad.  The log cleaner needs to be robust against cases
>>>>>> where
>>>>>>> the disk really is failing, and really is returning bad data or
>>> timing
>>>>>> out.
>>>>>>>> * For your metrics, can you spell out the full metric in JMX-style
>>>>>>>> format, such as:
>>>>>>>> 
>>>>>>  kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
>>>>>>>>               value=4
>>>>>>>> 
>>>>>>>> * For "uncleanable-partitions": topic-partition names can be very
>>>>> long.
>>>>>>>> I think the current max size is 210 characters (or maybe 240-ish?).
>>>>>>>> Having the "uncleanable-partitions" being a list could be very large
>>>>>>>> metric. Also, having the metric come out as a csv might be difficult
>>>>> to
>>>>>>>> work with for monitoring systems. If we *did* want the topic names
>>> to
>>>>>> be
>>>>>>>> accessible, what do you think of having the
>>>>>>>>       kafka.log:type=LogCleanerManager,topic=topic1,partition=2
>>>>>>>> I'm not sure if LogCleanerManager is the right type, but my example
>>>>> was
>>>>>>>> that the topic and partition can be tags in the metric. That will
>>>>> allow
>>>>>>>> monitoring systems to more easily slice and dice the metric. I'm not
>>>>>>>> sure what the attribute for that metric would be. Maybe something
>>>>> like
>>>>>>>> "uncleaned bytes" for that topic-partition? Or
>>> time-since-last-clean?
>>>>>> Or
>>>>>>>> maybe even just "Value=1".
>>>>>>> I haven't though about this that hard, but do we really need the
>>>>>>> uncleanable topic names to be accessible through a metric?  It seems
>>>>> like
>>>>>>> the admin should notice that uncleanable partitions are present, and
>>>>> then
>>>>>>> check the logs?
>>>>>>> 
>>>>>>>> * About `max.uncleanable.partitions`, you said that this likely
>>>>>>>> indicates that the disk is having problems. I'm not sure that is the
>>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner problems,
>>>>>> all
>>>>>>>> of them are partition-level scenarios that happened during normal
>>>>>>>> operation. None of them were indicative of disk problems.
>>>>>>> I don't think this is a meaningful comparison.  In general, we don't
>>>>>>> accept JIRAs for hard disk problems that happen on a particular
>>>>> cluster.
>>>>>>> If someone opened a JIRA that said "my hard disk is having problems"
>>> we
>>>>>>> could close that as "not a Kafka bug."  This doesn't prove that disk
>>>>>>> problems don't happen, but  just that JIRA isn't the right place for
>>>>>> them.
>>>>>>> I do agree that the log cleaner has had a significant number of logic
>>>>>>> bugs, and that we need to be careful to limit their impact.  That's
>>> one
>>>>>>> reason why I think that a threshold of "number of uncleanable logs"
>>> is
>>>>> a
>>>>>>> good idea, rather than just failing after one IOException.  In all
>>> the
>>>>>>> cases I've seen where a user hit a logic bug in the log cleaner, it
>>> was
>>>>>>> just one partition that had the issue.  We also should increase test
>>>>>>> coverage for the log cleaner.
>>>>>>> 
>>>>>>>> * About marking disks as offline when exceeding a certain threshold,
>>>>>>>> that actually increases the blast radius of log compaction failures.
>>>>>>>> Currently, the uncleaned partitions are still readable and writable.
>>>>>>>> Taking the disks offline would impact availability of the
>>> uncleanable
>>>>>>>> partitions, as well as impact all other partitions that are on the
>>>>>> disk.
>>>>>>> In general, when we encounter I/O errors, we take the disk partition
>>>>>>> offline.  This is spelled out in KIP-112 (
>>>>>>> 
>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
>>>>>>> ) :
>>>>>>> 
>>>>>>>> - Broker assumes a log directory to be good after it starts, and
>>> mark
>>>>>>> log directory as
>>>>>>>> bad once there is IOException when broker attempts to access (i.e.
>>>>> read
>>>>>>> or write) the log directory.
>>>>>>>> - Broker will be offline if all log directories are bad.
>>>>>>>> - Broker will stop serving replicas in any bad log directory. New
>>>>>>> replicas will only be created
>>>>>>>> on good log directory.
>>>>>>> The behavior Stanislav is proposing for the log cleaner is actually
>>>>> more
>>>>>>> optimistic than what we do for regular broker I/O, since we will
>>>>> tolerate
>>>>>>> multiple IOExceptions, not just one.  But it's generally consistent.
>>>>>>> Ignoring errors is not.  In any case, if you want to tolerate an
>>>>>> unlimited
>>>>>>> number of I/O errors, you can just set the threshold to an infinite
>>>>> value
>>>>>>> (although I think that would be a bad idea).
>>>>>>> 
>>>>>>> best,
>>>>>>> Colin
>>>>>>> 
>>>>>>>> -James
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
>>>>>>> stanislav@confluent.io> wrote:
>>>>>>>>> I renamed the KIP and that changed the link. Sorry about that. Here
>>>>>> is
>>>>>>> the
>>>>>>>>> new link:
>>>>>>>>> 
>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
>>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
>>>>>>> stanislav@confluent.io>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hey group,
>>>>>>>>>> 
>>>>>>>>>> I created a new KIP about making log compaction more
>>>>> fault-tolerant.
>>>>>>>>>> Please give it a look here and please share what you think,
>>>>>>> especially in
>>>>>>>>>> regards to the points in the "Needs Discussion" paragraph.
>>>>>>>>>> 
>>>>>>>>>> KIP: KIP-346
>>>>>>>>>> <
>>>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
>>>>>>>>>> --
>>>>>>>>>> Best,
>>>>>>>>>> Stanislav
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Best,
>>>>>>>>> Stanislav
>>>> 
>>> 
>>> 
>> 
>> --
>> Best,
>> Stanislav
>> 
> 
> 
> -- 
> Best,
> Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Stanislav Kozlovski <st...@confluent.io>.

Hey group,

I just wanted to note that I have an implementation ready for review. Feel
free to take a quick look and raise any concerns you might have in due
time. I plan on starting the voting thread tomorrow.

Best,
Stanislav

On Wed, Aug 1, 2018 at 10:01 AM Stanislav Kozlovski <st...@confluent.io>
wrote:

> Hey Ray,
>
> * Yes, we'd have the logDir as a tag in the metric
> * The intention is to have Int.MaxValue as the maximum uncleanable
> partitions count
> * My idea is to store the marked logs (actually partitions) in memory
> instead of the ".skip" files to keep the change simple. I have also decided
> to omit any retries from the implementation - once a partition is marked as
> "uncleanable" it stays so until a broker restart
>
> Please do let me know if you are okay with this description. I should have
> the code available for review soon
>
> Thanks,
> Stanislav
>
> On Tue, Jul 31, 2018 at 6:30 PM Ray Chiang <rc...@apache.org> wrote:
>
>> I had one question that I was trying to do some investigation before I
>> asked, but I'm having some issues with my JMX browser right now.
>>
>>   * For the uncleanable-partitions-count metric, is that going to be
>>     per-logDir entry?
>>   * For max.uncleanable.partitions, is the intention to have -1 be
>>     "infinite" or are we going to use Int.MaxValue as a practical
>>     equivalent?
>>   * In this sentence: "When evaluating which logs to compact, skip the
>>     marked ones.", should we define what "marking" will be?  If we're
>>     going with the ".skip" file or equivalent, can we also add how
>>     successful retries will behave?
>>
>> -Ray
>>
>> On 7/31/18 9:56 AM, Stanislav Kozlovski wrote:
>> > Hey group,
>> >
>> > I am planning on starting a voting thread tomorrow. Please do reply if
>> you
>> > feel there is anything left to discuss.
>> >
>> > Best,
>> > Stanislav
>> >
>> > On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
>> stanislav@confluent.io>
>> > wrote:
>> >
>> >> Hey, Ray
>> >>
>> >> Thanks for pointing that out, it's fixed now
>> >>
>> >> Best,
>> >> Stanislav
>> >>
>> >> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rc...@apache.org> wrote:
>> >>
>> >>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
>> >>> the main KIP landing page
>> >>> <
>> >>>
>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#
>> >?
>> >>>
>> >>> I tried, but the Wiki won't let me.
>> >>>
>> >>> -Ray
>> >>>
>> >>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
>> >>>> Hey guys,
>> >>>>
>> >>>> @Colin - good point. I added some sentences mentioning recent
>> >>> improvements
>> >>>> in the introductory section.
>> >>>>
>> >>>> *Disk Failure* - I tend to agree with what Colin said - once a disk
>> >>> fails,
>> >>>> you don't want to work with it again. As such, I've changed my mind
>> and
>> >>>> believe that we should mark the LogDir (assume its a disk) as
>> offline on
>> >>>> the first `IOException` encountered. This is the LogCleaner's current
>> >>>> behavior. We shouldn't change that.
>> >>>>
>> >>>> *Respawning Threads* - I believe we should never re-spawn a thread.
>> The
>> >>>> correct approach in my mind is to either have it stay dead or never
>> let
>> >>> it
>> >>>> die in the first place.
>> >>>>
>> >>>> *Uncleanable-partition-names metric* - Colin is right, this metric is
>> >>>> unneeded. Users can monitor the `uncleanable-partitions-count` metric
>> >>> and
>> >>>> inspect logs.
>> >>>>
>> >>>>
>> >>>> Hey Ray,
>> >>>>
>> >>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner
>> to
>> >>>>> skip over problematic partitions instead of dying.
>> >>>> I think we can do this for every exception that isn't `IOException`.
>> >>> This
>> >>>> will future-proof us against bugs in the system and potential other
>> >>> errors.
>> >>>> Protecting yourself against unexpected failures is always a good
>> thing
>> >>> in
>> >>>> my mind, but I also think that protecting yourself against bugs in
>> the
>> >>>> software is sort of clunky. What does everybody think about this?
>> >>>>
>> >>>>> 4) The only improvement I can think of is that if such an
>> >>>>> error occurs, then have the option (configuration setting?) to
>> create a
>> >>>>> <log_segment>.skip file (or something similar).
>> >>>> This is a good suggestion. Have others also seen corruption be
>> generally
>> >>>> tied to the same segment?
>> >>>>
>> >>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dh...@confluent.io>
>> >>> wrote:
>> >>>>> For the cleaner thread specifically, I do not think respawning will
>> >>> help at
>> >>>>> all because we are more than likely to run into the same issue again
>> >>> which
>> >>>>> would end up crashing the cleaner. Retrying makes sense for
>> transient
>> >>>>> errors or when you believe some part of the system could have healed
>> >>>>> itself, both of which I think are not true for the log cleaner.
>> >>>>>
>> >>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rn...@gmail.com>
>> >>> wrote:
>> >>>>>> <<<respawning threads is likely to make things worse, by putting
>> you
>> >>> in
>> >>>>> an
>> >>>>>> infinite loop which consumes resources and fires off continuous log
>> >>>>>> messages.
>> >>>>>> Hi Colin.  In case it could be relevant, one way to mitigate this
>> >>> effect
>> >>>>> is
>> >>>>>> to implement a backoff mechanism (if a second respawn is to occur
>> then
>> >>>>> wait
>> >>>>>> for 1 minute before doing it; then if a third respawn is to occur
>> wait
>> >>>>> for
>> >>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to
>> some
>> >>> max
>> >>>>>> wait time).
>> >>>>>>
>> >>>>>> I have no opinion on whether respawn is appropriate or not in this
>> >>>>> context,
>> >>>>>> but a mitigation like the increasing backoff described above may be
>> >>>>>> relevant in weighing the pros and cons.
>> >>>>>>
>> >>>>>> Ron
>> >>>>>>
>> >>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cm...@apache.org>
>> >>> wrote:
>> >>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
>> >>>>>>>> Hi Stanislav! Thanks for this KIP!
>> >>>>>>>>
>> >>>>>>>> I agree that it would be good if the LogCleaner were more
>> tolerant
>> >>> of
>> >>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
>> >>>>>>>>
>> >>>>>>>> Things are better now than they used to be. We have the metric
>> >>>>>>>>
>>  kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
>> >>>>>>>> which we can use to tell us if the threads are dead. And as of
>> >>> 1.1.0,
>> >>>>>> we
>> >>>>>>>> have KIP-226, which allows you to restart the log cleaner thread,
>> >>>>>>>> without requiring a broker restart.
>> >>>>>>>>
>> >>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>> >>>>>>>> <
>> >>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>> >>>>>>>> I've only read about this, I haven't personally tried it.
>> >>>>>>> Thanks for pointing this out, James!  Stanislav, we should
>> probably
>> >>>>> add a
>> >>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the
>> KIP.
>> >>>>>> Maybe
>> >>>>>>> in the intro section?
>> >>>>>>>
>> >>>>>>> I think it's clear that requiring the users to manually restart
>> the
>> >>> log
>> >>>>>>> cleaner is not a very good solution.  But it's good to know that
>> >>> it's a
>> >>>>>>> possibility on some older releases.
>> >>>>>>>
>> >>>>>>>> Some comments:
>> >>>>>>>> * I like the idea of having the log cleaner continue to clean as
>> >>> many
>> >>>>>>>> partitions as it can, skipping over the problematic ones if
>> >>> possible.
>> >>>>>>>> * If the log cleaner thread dies, I think it should
>> automatically be
>> >>>>>>>> revived. Your KIP attempts to do that by catching exceptions
>> during
>> >>>>>>>> execution, but I think we should go all the way and make sure
>> that a
>> >>>>>> new
>> >>>>>>>> one gets created, if the thread ever dies.
>> >>>>>>> This is inconsistent with the way the rest of Kafka works.  We
>> don't
>> >>>>>>> automatically re-create other threads in the broker if they
>> >>> terminate.
>> >>>>>> In
>> >>>>>>> general, if there is a serious bug in the code, respawning
>> threads is
>> >>>>>>> likely to make things worse, by putting you in an infinite loop
>> which
>> >>>>>>> consumes resources and fires off continuous log messages.
>> >>>>>>>
>> >>>>>>>> * It might be worth trying to re-clean the uncleanable
>> partitions.
>> >>>>> I've
>> >>>>>>>> seen cases where an uncleanable partition later became
>> cleanable. I
>> >>>>>>>> unfortunately don't remember how that happened, but I remember
>> being
>> >>>>>>>> surprised when I discovered it. It might have been something
>> like a
>> >>>>>>>> follower was uncleanable but after a leader election happened,
>> the
>> >>>>> log
>> >>>>>>>> truncated and it was then cleanable again. I'm not sure.
>> >>>>>>> James, I disagree.  We had this behavior in the Hadoop Distributed
>> >>> File
>> >>>>>>> System (HDFS) and it was a constant source of user problems.
>> >>>>>>>
>> >>>>>>> What would happen is disks would just go bad over time.  The
>> DataNode
>> >>>>>>> would notice this and take them offline.  But then, due to some
>> >>>>>>> "optimistic" code, the DataNode would periodically try to re-add
>> them
>> >>>>> to
>> >>>>>>> the system.  Then one of two things would happen: the disk would
>> just
>> >>>>>> fail
>> >>>>>>> immediately again, or it would appear to work and then fail after
>> a
>> >>>>> short
>> >>>>>>> amount of time.
>> >>>>>>>
>> >>>>>>> The way the disk failed was normally having an I/O request take a
>> >>>>> really
>> >>>>>>> long time and time out.  So a bunch of request handler threads
>> would
>> >>>>>>> basically slam into a brick wall when they tried to access the bad
>> >>>>> disk,
>> >>>>>>> slowing the DataNode to a crawl.  It was even worse in the second
>> >>>>>> scenario,
>> >>>>>>> if the disk appeared to work for a while, but then failed.  Any
>> data
>> >>>>> that
>> >>>>>>> had been written on that DataNode to that disk would be lost, and
>> we
>> >>>>>> would
>> >>>>>>> need to re-replicate it.
>> >>>>>>>
>> >>>>>>> Disks aren't biological systems-- they don't heal over time.  Once
>> >>>>>> they're
>> >>>>>>> bad, they stay bad.  The log cleaner needs to be robust against
>> cases
>> >>>>>> where
>> >>>>>>> the disk really is failing, and really is returning bad data or
>> >>> timing
>> >>>>>> out.
>> >>>>>>>> * For your metrics, can you spell out the full metric in
>> JMX-style
>> >>>>>>>> format, such as:
>> >>>>>>>>
>> >>>>>>
>> kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
>> >>>>>>>>                 value=4
>> >>>>>>>>
>> >>>>>>>> * For "uncleanable-partitions": topic-partition names can be very
>> >>>>> long.
>> >>>>>>>> I think the current max size is 210 characters (or maybe
>> 240-ish?).
>> >>>>>>>> Having the "uncleanable-partitions" being a list could be very
>> large
>> >>>>>>>> metric. Also, having the metric come out as a csv might be
>> difficult
>> >>>>> to
>> >>>>>>>> work with for monitoring systems. If we *did* want the topic
>> names
>> >>> to
>> >>>>>> be
>> >>>>>>>> accessible, what do you think of having the
>> >>>>>>>>         kafka.log:type=LogCleanerManager,topic=topic1,partition=2
>> >>>>>>>> I'm not sure if LogCleanerManager is the right type, but my
>> example
>> >>>>> was
>> >>>>>>>> that the topic and partition can be tags in the metric. That will
>> >>>>> allow
>> >>>>>>>> monitoring systems to more easily slice and dice the metric. I'm
>> not
>> >>>>>>>> sure what the attribute for that metric would be. Maybe something
>> >>>>> like
>> >>>>>>>> "uncleaned bytes" for that topic-partition? Or
>> >>> time-since-last-clean?
>> >>>>>> Or
>> >>>>>>>> maybe even just "Value=1".
>> >>>>>>> I haven't though about this that hard, but do we really need the
>> >>>>>>> uncleanable topic names to be accessible through a metric?  It
>> seems
>> >>>>> like
>> >>>>>>> the admin should notice that uncleanable partitions are present,
>> and
>> >>>>> then
>> >>>>>>> check the logs?
>> >>>>>>>
>> >>>>>>>> * About `max.uncleanable.partitions`, you said that this likely
>> >>>>>>>> indicates that the disk is having problems. I'm not sure that is
>> the
>> >>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
>> problems,
>> >>>>>> all
>> >>>>>>>> of them are partition-level scenarios that happened during normal
>> >>>>>>>> operation. None of them were indicative of disk problems.
>> >>>>>>> I don't think this is a meaningful comparison.  In general, we
>> don't
>> >>>>>>> accept JIRAs for hard disk problems that happen on a particular
>> >>>>> cluster.
>> >>>>>>> If someone opened a JIRA that said "my hard disk is having
>> problems"
>> >>> we
>> >>>>>>> could close that as "not a Kafka bug."  This doesn't prove that
>> disk
>> >>>>>>> problems don't happen, but  just that JIRA isn't the right place
>> for
>> >>>>>> them.
>> >>>>>>> I do agree that the log cleaner has had a significant number of
>> logic
>> >>>>>>> bugs, and that we need to be careful to limit their impact.
>> That's
>> >>> one
>> >>>>>>> reason why I think that a threshold of "number of uncleanable
>> logs"
>> >>> is
>> >>>>> a
>> >>>>>>> good idea, rather than just failing after one IOException.  In all
>> >>> the
>> >>>>>>> cases I've seen where a user hit a logic bug in the log cleaner,
>> it
>> >>> was
>> >>>>>>> just one partition that had the issue.  We also should increase
>> test
>> >>>>>>> coverage for the log cleaner.
>> >>>>>>>
>> >>>>>>>> * About marking disks as offline when exceeding a certain
>> threshold,
>> >>>>>>>> that actually increases the blast radius of log compaction
>> failures.
>> >>>>>>>> Currently, the uncleaned partitions are still readable and
>> writable.
>> >>>>>>>> Taking the disks offline would impact availability of the
>> >>> uncleanable
>> >>>>>>>> partitions, as well as impact all other partitions that are on
>> the
>> >>>>>> disk.
>> >>>>>>> In general, when we encounter I/O errors, we take the disk
>> partition
>> >>>>>>> offline.  This is spelled out in KIP-112 (
>> >>>>>>>
>> >>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
>> >>>>>>> ) :
>> >>>>>>>
>> >>>>>>>> - Broker assumes a log directory to be good after it starts, and
>> >>> mark
>> >>>>>>> log directory as
>> >>>>>>>> bad once there is IOException when broker attempts to access
>> (i.e.
>> >>>>> read
>> >>>>>>> or write) the log directory.
>> >>>>>>>> - Broker will be offline if all log directories are bad.
>> >>>>>>>> - Broker will stop serving replicas in any bad log directory. New
>> >>>>>>> replicas will only be created
>> >>>>>>>> on good log directory.
>> >>>>>>> The behavior Stanislav is proposing for the log cleaner is
>> actually
>> >>>>> more
>> >>>>>>> optimistic than what we do for regular broker I/O, since we will
>> >>>>> tolerate
>> >>>>>>> multiple IOExceptions, not just one.  But it's generally
>> consistent.
>> >>>>>>> Ignoring errors is not.  In any case, if you want to tolerate an
>> >>>>>> unlimited
>> >>>>>>> number of I/O errors, you can just set the threshold to an
>> infinite
>> >>>>> value
>> >>>>>>> (although I think that would be a bad idea).
>> >>>>>>>
>> >>>>>>> best,
>> >>>>>>> Colin
>> >>>>>>>
>> >>>>>>>> -James
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
>> >>>>>>> stanislav@confluent.io> wrote:
>> >>>>>>>>> I renamed the KIP and that changed the link. Sorry about that.
>> Here
>> >>>>>> is
>> >>>>>>> the
>> >>>>>>>>> new link:
>> >>>>>>>>>
>> >>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
>> >>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
>> >>>>>>> stanislav@confluent.io>
>> >>>>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Hey group,
>> >>>>>>>>>>
>> >>>>>>>>>> I created a new KIP about making log compaction more
>> >>>>> fault-tolerant.
>> >>>>>>>>>> Please give it a look here and please share what you think,
>> >>>>>>> especially in
>> >>>>>>>>>> regards to the points in the "Needs Discussion" paragraph.
>> >>>>>>>>>>
>> >>>>>>>>>> KIP: KIP-346
>> >>>>>>>>>> <
>> >>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
>> >>>>>>>>>> --
>> >>>>>>>>>> Best,
>> >>>>>>>>>> Stanislav
>> >>>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>> Best,
>> >>>>>>>>> Stanislav
>> >>>
>> >> --
>> >> Best,
>> >> Stanislav
>> >>
>> >
>>
>>
>
> --
> Best,
> Stanislav
>


-- 
Best,
Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Stanislav Kozlovski <st...@confluent.io>.

Hey Ray,

* Yes, we'd have the logDir as a tag in the metric
* The intention is to have Int.MaxValue as the maximum uncleanable
partitions count
* My idea is to store the marked logs (actually partitions) in memory
instead of the ".skip" files to keep the change simple. I have also decided
to omit any retries from the implementation - once a partition is marked as
"uncleanable" it stays so until a broker restart

Please do let me know if you are okay with this description. I should have
the code available for review soon

Thanks,
Stanislav

On Tue, Jul 31, 2018 at 6:30 PM Ray Chiang <rc...@apache.org> wrote:

> I had one question that I was trying to do some investigation before I
> asked, but I'm having some issues with my JMX browser right now.
>
>   * For the uncleanable-partitions-count metric, is that going to be
>     per-logDir entry?
>   * For max.uncleanable.partitions, is the intention to have -1 be
>     "infinite" or are we going to use Int.MaxValue as a practical
>     equivalent?
>   * In this sentence: "When evaluating which logs to compact, skip the
>     marked ones.", should we define what "marking" will be?  If we're
>     going with the ".skip" file or equivalent, can we also add how
>     successful retries will behave?
>
> -Ray
>
> On 7/31/18 9:56 AM, Stanislav Kozlovski wrote:
> > Hey group,
> >
> > I am planning on starting a voting thread tomorrow. Please do reply if
> you
> > feel there is anything left to discuss.
> >
> > Best,
> > Stanislav
> >
> > On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
> stanislav@confluent.io>
> > wrote:
> >
> >> Hey, Ray
> >>
> >> Thanks for pointing that out, it's fixed now
> >>
> >> Best,
> >> Stanislav
> >>
> >> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rc...@apache.org> wrote:
> >>
> >>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
> >>> the main KIP landing page
> >>> <
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#
> >?
> >>>
> >>> I tried, but the Wiki won't let me.
> >>>
> >>> -Ray
> >>>
> >>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
> >>>> Hey guys,
> >>>>
> >>>> @Colin - good point. I added some sentences mentioning recent
> >>> improvements
> >>>> in the introductory section.
> >>>>
> >>>> *Disk Failure* - I tend to agree with what Colin said - once a disk
> >>> fails,
> >>>> you don't want to work with it again. As such, I've changed my mind
> and
> >>>> believe that we should mark the LogDir (assume its a disk) as offline
> on
> >>>> the first `IOException` encountered. This is the LogCleaner's current
> >>>> behavior. We shouldn't change that.
> >>>>
> >>>> *Respawning Threads* - I believe we should never re-spawn a thread.
> The
> >>>> correct approach in my mind is to either have it stay dead or never
> let
> >>> it
> >>>> die in the first place.
> >>>>
> >>>> *Uncleanable-partition-names metric* - Colin is right, this metric is
> >>>> unneeded. Users can monitor the `uncleanable-partitions-count` metric
> >>> and
> >>>> inspect logs.
> >>>>
> >>>>
> >>>> Hey Ray,
> >>>>
> >>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner to
> >>>>> skip over problematic partitions instead of dying.
> >>>> I think we can do this for every exception that isn't `IOException`.
> >>> This
> >>>> will future-proof us against bugs in the system and potential other
> >>> errors.
> >>>> Protecting yourself against unexpected failures is always a good thing
> >>> in
> >>>> my mind, but I also think that protecting yourself against bugs in the
> >>>> software is sort of clunky. What does everybody think about this?
> >>>>
> >>>>> 4) The only improvement I can think of is that if such an
> >>>>> error occurs, then have the option (configuration setting?) to
> create a
> >>>>> <log_segment>.skip file (or something similar).
> >>>> This is a good suggestion. Have others also seen corruption be
> generally
> >>>> tied to the same segment?
> >>>>
> >>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dh...@confluent.io>
> >>> wrote:
> >>>>> For the cleaner thread specifically, I do not think respawning will
> >>> help at
> >>>>> all because we are more than likely to run into the same issue again
> >>> which
> >>>>> would end up crashing the cleaner. Retrying makes sense for transient
> >>>>> errors or when you believe some part of the system could have healed
> >>>>> itself, both of which I think are not true for the log cleaner.
> >>>>>
> >>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rn...@gmail.com>
> >>> wrote:
> >>>>>> <<<respawning threads is likely to make things worse, by putting you
> >>> in
> >>>>> an
> >>>>>> infinite loop which consumes resources and fires off continuous log
> >>>>>> messages.
> >>>>>> Hi Colin.  In case it could be relevant, one way to mitigate this
> >>> effect
> >>>>> is
> >>>>>> to implement a backoff mechanism (if a second respawn is to occur
> then
> >>>>> wait
> >>>>>> for 1 minute before doing it; then if a third respawn is to occur
> wait
> >>>>> for
> >>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to
> some
> >>> max
> >>>>>> wait time).
> >>>>>>
> >>>>>> I have no opinion on whether respawn is appropriate or not in this
> >>>>> context,
> >>>>>> but a mitigation like the increasing backoff described above may be
> >>>>>> relevant in weighing the pros and cons.
> >>>>>>
> >>>>>> Ron
> >>>>>>
> >>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cm...@apache.org>
> >>> wrote:
> >>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> >>>>>>>> Hi Stanislav! Thanks for this KIP!
> >>>>>>>>
> >>>>>>>> I agree that it would be good if the LogCleaner were more tolerant
> >>> of
> >>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
> >>>>>>>>
> >>>>>>>> Things are better now than they used to be. We have the metric
> >>>>>>>>
>  kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
> >>>>>>>> which we can use to tell us if the threads are dead. And as of
> >>> 1.1.0,
> >>>>>> we
> >>>>>>>> have KIP-226, which allows you to restart the log cleaner thread,
> >>>>>>>> without requiring a broker restart.
> >>>>>>>>
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> >>>>>>>> <
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> >>>>>>>> I've only read about this, I haven't personally tried it.
> >>>>>>> Thanks for pointing this out, James!  Stanislav, we should probably
> >>>>> add a
> >>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the
> KIP.
> >>>>>> Maybe
> >>>>>>> in the intro section?
> >>>>>>>
> >>>>>>> I think it's clear that requiring the users to manually restart the
> >>> log
> >>>>>>> cleaner is not a very good solution.  But it's good to know that
> >>> it's a
> >>>>>>> possibility on some older releases.
> >>>>>>>
> >>>>>>>> Some comments:
> >>>>>>>> * I like the idea of having the log cleaner continue to clean as
> >>> many
> >>>>>>>> partitions as it can, skipping over the problematic ones if
> >>> possible.
> >>>>>>>> * If the log cleaner thread dies, I think it should automatically
> be
> >>>>>>>> revived. Your KIP attempts to do that by catching exceptions
> during
> >>>>>>>> execution, but I think we should go all the way and make sure
> that a
> >>>>>> new
> >>>>>>>> one gets created, if the thread ever dies.
> >>>>>>> This is inconsistent with the way the rest of Kafka works.  We
> don't
> >>>>>>> automatically re-create other threads in the broker if they
> >>> terminate.
> >>>>>> In
> >>>>>>> general, if there is a serious bug in the code, respawning threads
> is
> >>>>>>> likely to make things worse, by putting you in an infinite loop
> which
> >>>>>>> consumes resources and fires off continuous log messages.
> >>>>>>>
> >>>>>>>> * It might be worth trying to re-clean the uncleanable partitions.
> >>>>> I've
> >>>>>>>> seen cases where an uncleanable partition later became cleanable.
> I
> >>>>>>>> unfortunately don't remember how that happened, but I remember
> being
> >>>>>>>> surprised when I discovered it. It might have been something like
> a
> >>>>>>>> follower was uncleanable but after a leader election happened, the
> >>>>> log
> >>>>>>>> truncated and it was then cleanable again. I'm not sure.
> >>>>>>> James, I disagree.  We had this behavior in the Hadoop Distributed
> >>> File
> >>>>>>> System (HDFS) and it was a constant source of user problems.
> >>>>>>>
> >>>>>>> What would happen is disks would just go bad over time.  The
> DataNode
> >>>>>>> would notice this and take them offline.  But then, due to some
> >>>>>>> "optimistic" code, the DataNode would periodically try to re-add
> them
> >>>>> to
> >>>>>>> the system.  Then one of two things would happen: the disk would
> just
> >>>>>> fail
> >>>>>>> immediately again, or it would appear to work and then fail after a
> >>>>> short
> >>>>>>> amount of time.
> >>>>>>>
> >>>>>>> The way the disk failed was normally having an I/O request take a
> >>>>> really
> >>>>>>> long time and time out.  So a bunch of request handler threads
> would
> >>>>>>> basically slam into a brick wall when they tried to access the bad
> >>>>> disk,
> >>>>>>> slowing the DataNode to a crawl.  It was even worse in the second
> >>>>>> scenario,
> >>>>>>> if the disk appeared to work for a while, but then failed.  Any
> data
> >>>>> that
> >>>>>>> had been written on that DataNode to that disk would be lost, and
> we
> >>>>>> would
> >>>>>>> need to re-replicate it.
> >>>>>>>
> >>>>>>> Disks aren't biological systems-- they don't heal over time.  Once
> >>>>>> they're
> >>>>>>> bad, they stay bad.  The log cleaner needs to be robust against
> cases
> >>>>>> where
> >>>>>>> the disk really is failing, and really is returning bad data or
> >>> timing
> >>>>>> out.
> >>>>>>>> * For your metrics, can you spell out the full metric in JMX-style
> >>>>>>>> format, such as:
> >>>>>>>>
> >>>>>>
> kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
> >>>>>>>>                 value=4
> >>>>>>>>
> >>>>>>>> * For "uncleanable-partitions": topic-partition names can be very
> >>>>> long.
> >>>>>>>> I think the current max size is 210 characters (or maybe
> 240-ish?).
> >>>>>>>> Having the "uncleanable-partitions" being a list could be very
> large
> >>>>>>>> metric. Also, having the metric come out as a csv might be
> difficult
> >>>>> to
> >>>>>>>> work with for monitoring systems. If we *did* want the topic names
> >>> to
> >>>>>> be
> >>>>>>>> accessible, what do you think of having the
> >>>>>>>>         kafka.log:type=LogCleanerManager,topic=topic1,partition=2
> >>>>>>>> I'm not sure if LogCleanerManager is the right type, but my
> example
> >>>>> was
> >>>>>>>> that the topic and partition can be tags in the metric. That will
> >>>>> allow
> >>>>>>>> monitoring systems to more easily slice and dice the metric. I'm
> not
> >>>>>>>> sure what the attribute for that metric would be. Maybe something
> >>>>> like
> >>>>>>>> "uncleaned bytes" for that topic-partition? Or
> >>> time-since-last-clean?
> >>>>>> Or
> >>>>>>>> maybe even just "Value=1".
> >>>>>>> I haven't though about this that hard, but do we really need the
> >>>>>>> uncleanable topic names to be accessible through a metric?  It
> seems
> >>>>> like
> >>>>>>> the admin should notice that uncleanable partitions are present,
> and
> >>>>> then
> >>>>>>> check the logs?
> >>>>>>>
> >>>>>>>> * About `max.uncleanable.partitions`, you said that this likely
> >>>>>>>> indicates that the disk is having problems. I'm not sure that is
> the
> >>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
> problems,
> >>>>>> all
> >>>>>>>> of them are partition-level scenarios that happened during normal
> >>>>>>>> operation. None of them were indicative of disk problems.
> >>>>>>> I don't think this is a meaningful comparison.  In general, we
> don't
> >>>>>>> accept JIRAs for hard disk problems that happen on a particular
> >>>>> cluster.
> >>>>>>> If someone opened a JIRA that said "my hard disk is having
> problems"
> >>> we
> >>>>>>> could close that as "not a Kafka bug."  This doesn't prove that
> disk
> >>>>>>> problems don't happen, but  just that JIRA isn't the right place
> for
> >>>>>> them.
> >>>>>>> I do agree that the log cleaner has had a significant number of
> logic
> >>>>>>> bugs, and that we need to be careful to limit their impact.  That's
> >>> one
> >>>>>>> reason why I think that a threshold of "number of uncleanable logs"
> >>> is
> >>>>> a
> >>>>>>> good idea, rather than just failing after one IOException.  In all
> >>> the
> >>>>>>> cases I've seen where a user hit a logic bug in the log cleaner, it
> >>> was
> >>>>>>> just one partition that had the issue.  We also should increase
> test
> >>>>>>> coverage for the log cleaner.
> >>>>>>>
> >>>>>>>> * About marking disks as offline when exceeding a certain
> threshold,
> >>>>>>>> that actually increases the blast radius of log compaction
> failures.
> >>>>>>>> Currently, the uncleaned partitions are still readable and
> writable.
> >>>>>>>> Taking the disks offline would impact availability of the
> >>> uncleanable
> >>>>>>>> partitions, as well as impact all other partitions that are on the
> >>>>>> disk.
> >>>>>>> In general, when we encounter I/O errors, we take the disk
> partition
> >>>>>>> offline.  This is spelled out in KIP-112 (
> >>>>>>>
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
> >>>>>>> ) :
> >>>>>>>
> >>>>>>>> - Broker assumes a log directory to be good after it starts, and
> >>> mark
> >>>>>>> log directory as
> >>>>>>>> bad once there is IOException when broker attempts to access (i.e.
> >>>>> read
> >>>>>>> or write) the log directory.
> >>>>>>>> - Broker will be offline if all log directories are bad.
> >>>>>>>> - Broker will stop serving replicas in any bad log directory. New
> >>>>>>> replicas will only be created
> >>>>>>>> on good log directory.
> >>>>>>> The behavior Stanislav is proposing for the log cleaner is actually
> >>>>> more
> >>>>>>> optimistic than what we do for regular broker I/O, since we will
> >>>>> tolerate
> >>>>>>> multiple IOExceptions, not just one.  But it's generally
> consistent.
> >>>>>>> Ignoring errors is not.  In any case, if you want to tolerate an
> >>>>>> unlimited
> >>>>>>> number of I/O errors, you can just set the threshold to an infinite
> >>>>> value
> >>>>>>> (although I think that would be a bad idea).
> >>>>>>>
> >>>>>>> best,
> >>>>>>> Colin
> >>>>>>>
> >>>>>>>> -James
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
> >>>>>>> stanislav@confluent.io> wrote:
> >>>>>>>>> I renamed the KIP and that changed the link. Sorry about that.
> Here
> >>>>>> is
> >>>>>>> the
> >>>>>>>>> new link:
> >>>>>>>>>
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> >>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
> >>>>>>> stanislav@confluent.io>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hey group,
> >>>>>>>>>>
> >>>>>>>>>> I created a new KIP about making log compaction more
> >>>>> fault-tolerant.
> >>>>>>>>>> Please give it a look here and please share what you think,
> >>>>>>> especially in
> >>>>>>>>>> regards to the points in the "Needs Discussion" paragraph.
> >>>>>>>>>>
> >>>>>>>>>> KIP: KIP-346
> >>>>>>>>>> <
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
> >>>>>>>>>> --
> >>>>>>>>>> Best,
> >>>>>>>>>> Stanislav
> >>>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Best,
> >>>>>>>>> Stanislav
> >>>
> >> --
> >> Best,
> >> Stanislav
> >>
> >
>
>

-- 
Best,
Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Ray Chiang <rc...@apache.org>.

I had one question that I was trying to do some investigation before I 
asked, but I'm having some issues with my JMX browser right now.

  * For the uncleanable-partitions-count metric, is that going to be
    per-logDir entry?
  * For max.uncleanable.partitions, is the intention to have -1 be
    "infinite" or are we going to use Int.MaxValue as a practical
    equivalent?
  * In this sentence: "When evaluating which logs to compact, skip the
    marked ones.", should we define what "marking" will be?  If we're
    going with the ".skip" file or equivalent, can we also add how
    successful retries will behave?

-Ray

On 7/31/18 9:56 AM, Stanislav Kozlovski wrote:
> Hey group,
>
> I am planning on starting a voting thread tomorrow. Please do reply if you
> feel there is anything left to discuss.
>
> Best,
> Stanislav
>
> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <st...@confluent.io>
> wrote:
>
>> Hey, Ray
>>
>> Thanks for pointing that out, it's fixed now
>>
>> Best,
>> Stanislav
>>
>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rc...@apache.org> wrote:
>>
>>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
>>> the main KIP landing page
>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#>?
>>>
>>> I tried, but the Wiki won't let me.
>>>
>>> -Ray
>>>
>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
>>>> Hey guys,
>>>>
>>>> @Colin - good point. I added some sentences mentioning recent
>>> improvements
>>>> in the introductory section.
>>>>
>>>> *Disk Failure* - I tend to agree with what Colin said - once a disk
>>> fails,
>>>> you don't want to work with it again. As such, I've changed my mind and
>>>> believe that we should mark the LogDir (assume its a disk) as offline on
>>>> the first `IOException` encountered. This is the LogCleaner's current
>>>> behavior. We shouldn't change that.
>>>>
>>>> *Respawning Threads* - I believe we should never re-spawn a thread. The
>>>> correct approach in my mind is to either have it stay dead or never let
>>> it
>>>> die in the first place.
>>>>
>>>> *Uncleanable-partition-names metric* - Colin is right, this metric is
>>>> unneeded. Users can monitor the `uncleanable-partitions-count` metric
>>> and
>>>> inspect logs.
>>>>
>>>>
>>>> Hey Ray,
>>>>
>>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner to
>>>>> skip over problematic partitions instead of dying.
>>>> I think we can do this for every exception that isn't `IOException`.
>>> This
>>>> will future-proof us against bugs in the system and potential other
>>> errors.
>>>> Protecting yourself against unexpected failures is always a good thing
>>> in
>>>> my mind, but I also think that protecting yourself against bugs in the
>>>> software is sort of clunky. What does everybody think about this?
>>>>
>>>>> 4) The only improvement I can think of is that if such an
>>>>> error occurs, then have the option (configuration setting?) to create a
>>>>> <log_segment>.skip file (or something similar).
>>>> This is a good suggestion. Have others also seen corruption be generally
>>>> tied to the same segment?
>>>>
>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dh...@confluent.io>
>>> wrote:
>>>>> For the cleaner thread specifically, I do not think respawning will
>>> help at
>>>>> all because we are more than likely to run into the same issue again
>>> which
>>>>> would end up crashing the cleaner. Retrying makes sense for transient
>>>>> errors or when you believe some part of the system could have healed
>>>>> itself, both of which I think are not true for the log cleaner.
>>>>>
>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rn...@gmail.com>
>>> wrote:
>>>>>> <<<respawning threads is likely to make things worse, by putting you
>>> in
>>>>> an
>>>>>> infinite loop which consumes resources and fires off continuous log
>>>>>> messages.
>>>>>> Hi Colin.  In case it could be relevant, one way to mitigate this
>>> effect
>>>>> is
>>>>>> to implement a backoff mechanism (if a second respawn is to occur then
>>>>> wait
>>>>>> for 1 minute before doing it; then if a third respawn is to occur wait
>>>>> for
>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to some
>>> max
>>>>>> wait time).
>>>>>>
>>>>>> I have no opinion on whether respawn is appropriate or not in this
>>>>> context,
>>>>>> but a mitigation like the increasing backoff described above may be
>>>>>> relevant in weighing the pros and cons.
>>>>>>
>>>>>> Ron
>>>>>>
>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cm...@apache.org>
>>> wrote:
>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
>>>>>>>> Hi Stanislav! Thanks for this KIP!
>>>>>>>>
>>>>>>>> I agree that it would be good if the LogCleaner were more tolerant
>>> of
>>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
>>>>>>>>
>>>>>>>> Things are better now than they used to be. We have the metric
>>>>>>>>         kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
>>>>>>>> which we can use to tell us if the threads are dead. And as of
>>> 1.1.0,
>>>>>> we
>>>>>>>> have KIP-226, which allows you to restart the log cleaner thread,
>>>>>>>> without requiring a broker restart.
>>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>>>>>>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>>>>>>>> I've only read about this, I haven't personally tried it.
>>>>>>> Thanks for pointing this out, James!  Stanislav, we should probably
>>>>> add a
>>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the KIP.
>>>>>> Maybe
>>>>>>> in the intro section?
>>>>>>>
>>>>>>> I think it's clear that requiring the users to manually restart the
>>> log
>>>>>>> cleaner is not a very good solution.  But it's good to know that
>>> it's a
>>>>>>> possibility on some older releases.
>>>>>>>
>>>>>>>> Some comments:
>>>>>>>> * I like the idea of having the log cleaner continue to clean as
>>> many
>>>>>>>> partitions as it can, skipping over the problematic ones if
>>> possible.
>>>>>>>> * If the log cleaner thread dies, I think it should automatically be
>>>>>>>> revived. Your KIP attempts to do that by catching exceptions during
>>>>>>>> execution, but I think we should go all the way and make sure that a
>>>>>> new
>>>>>>>> one gets created, if the thread ever dies.
>>>>>>> This is inconsistent with the way the rest of Kafka works.  We don't
>>>>>>> automatically re-create other threads in the broker if they
>>> terminate.
>>>>>> In
>>>>>>> general, if there is a serious bug in the code, respawning threads is
>>>>>>> likely to make things worse, by putting you in an infinite loop which
>>>>>>> consumes resources and fires off continuous log messages.
>>>>>>>
>>>>>>>> * It might be worth trying to re-clean the uncleanable partitions.
>>>>> I've
>>>>>>>> seen cases where an uncleanable partition later became cleanable. I
>>>>>>>> unfortunately don't remember how that happened, but I remember being
>>>>>>>> surprised when I discovered it. It might have been something like a
>>>>>>>> follower was uncleanable but after a leader election happened, the
>>>>> log
>>>>>>>> truncated and it was then cleanable again. I'm not sure.
>>>>>>> James, I disagree.  We had this behavior in the Hadoop Distributed
>>> File
>>>>>>> System (HDFS) and it was a constant source of user problems.
>>>>>>>
>>>>>>> What would happen is disks would just go bad over time.  The DataNode
>>>>>>> would notice this and take them offline.  But then, due to some
>>>>>>> "optimistic" code, the DataNode would periodically try to re-add them
>>>>> to
>>>>>>> the system.  Then one of two things would happen: the disk would just
>>>>>> fail
>>>>>>> immediately again, or it would appear to work and then fail after a
>>>>> short
>>>>>>> amount of time.
>>>>>>>
>>>>>>> The way the disk failed was normally having an I/O request take a
>>>>> really
>>>>>>> long time and time out.  So a bunch of request handler threads would
>>>>>>> basically slam into a brick wall when they tried to access the bad
>>>>> disk,
>>>>>>> slowing the DataNode to a crawl.  It was even worse in the second
>>>>>> scenario,
>>>>>>> if the disk appeared to work for a while, but then failed.  Any data
>>>>> that
>>>>>>> had been written on that DataNode to that disk would be lost, and we
>>>>>> would
>>>>>>> need to re-replicate it.
>>>>>>>
>>>>>>> Disks aren't biological systems-- they don't heal over time.  Once
>>>>>> they're
>>>>>>> bad, they stay bad.  The log cleaner needs to be robust against cases
>>>>>> where
>>>>>>> the disk really is failing, and really is returning bad data or
>>> timing
>>>>>> out.
>>>>>>>> * For your metrics, can you spell out the full metric in JMX-style
>>>>>>>> format, such as:
>>>>>>>>
>>>>>>    kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
>>>>>>>>                 value=4
>>>>>>>>
>>>>>>>> * For "uncleanable-partitions": topic-partition names can be very
>>>>> long.
>>>>>>>> I think the current max size is 210 characters (or maybe 240-ish?).
>>>>>>>> Having the "uncleanable-partitions" being a list could be very large
>>>>>>>> metric. Also, having the metric come out as a csv might be difficult
>>>>> to
>>>>>>>> work with for monitoring systems. If we *did* want the topic names
>>> to
>>>>>> be
>>>>>>>> accessible, what do you think of having the
>>>>>>>>         kafka.log:type=LogCleanerManager,topic=topic1,partition=2
>>>>>>>> I'm not sure if LogCleanerManager is the right type, but my example
>>>>> was
>>>>>>>> that the topic and partition can be tags in the metric. That will
>>>>> allow
>>>>>>>> monitoring systems to more easily slice and dice the metric. I'm not
>>>>>>>> sure what the attribute for that metric would be. Maybe something
>>>>> like
>>>>>>>> "uncleaned bytes" for that topic-partition? Or
>>> time-since-last-clean?
>>>>>> Or
>>>>>>>> maybe even just "Value=1".
>>>>>>> I haven't though about this that hard, but do we really need the
>>>>>>> uncleanable topic names to be accessible through a metric?  It seems
>>>>> like
>>>>>>> the admin should notice that uncleanable partitions are present, and
>>>>> then
>>>>>>> check the logs?
>>>>>>>
>>>>>>>> * About `max.uncleanable.partitions`, you said that this likely
>>>>>>>> indicates that the disk is having problems. I'm not sure that is the
>>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner problems,
>>>>>> all
>>>>>>>> of them are partition-level scenarios that happened during normal
>>>>>>>> operation. None of them were indicative of disk problems.
>>>>>>> I don't think this is a meaningful comparison.  In general, we don't
>>>>>>> accept JIRAs for hard disk problems that happen on a particular
>>>>> cluster.
>>>>>>> If someone opened a JIRA that said "my hard disk is having problems"
>>> we
>>>>>>> could close that as "not a Kafka bug."  This doesn't prove that disk
>>>>>>> problems don't happen, but  just that JIRA isn't the right place for
>>>>>> them.
>>>>>>> I do agree that the log cleaner has had a significant number of logic
>>>>>>> bugs, and that we need to be careful to limit their impact.  That's
>>> one
>>>>>>> reason why I think that a threshold of "number of uncleanable logs"
>>> is
>>>>> a
>>>>>>> good idea, rather than just failing after one IOException.  In all
>>> the
>>>>>>> cases I've seen where a user hit a logic bug in the log cleaner, it
>>> was
>>>>>>> just one partition that had the issue.  We also should increase test
>>>>>>> coverage for the log cleaner.
>>>>>>>
>>>>>>>> * About marking disks as offline when exceeding a certain threshold,
>>>>>>>> that actually increases the blast radius of log compaction failures.
>>>>>>>> Currently, the uncleaned partitions are still readable and writable.
>>>>>>>> Taking the disks offline would impact availability of the
>>> uncleanable
>>>>>>>> partitions, as well as impact all other partitions that are on the
>>>>>> disk.
>>>>>>> In general, when we encounter I/O errors, we take the disk partition
>>>>>>> offline.  This is spelled out in KIP-112 (
>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
>>>>>>> ) :
>>>>>>>
>>>>>>>> - Broker assumes a log directory to be good after it starts, and
>>> mark
>>>>>>> log directory as
>>>>>>>> bad once there is IOException when broker attempts to access (i.e.
>>>>> read
>>>>>>> or write) the log directory.
>>>>>>>> - Broker will be offline if all log directories are bad.
>>>>>>>> - Broker will stop serving replicas in any bad log directory. New
>>>>>>> replicas will only be created
>>>>>>>> on good log directory.
>>>>>>> The behavior Stanislav is proposing for the log cleaner is actually
>>>>> more
>>>>>>> optimistic than what we do for regular broker I/O, since we will
>>>>> tolerate
>>>>>>> multiple IOExceptions, not just one.  But it's generally consistent.
>>>>>>> Ignoring errors is not.  In any case, if you want to tolerate an
>>>>>> unlimited
>>>>>>> number of I/O errors, you can just set the threshold to an infinite
>>>>> value
>>>>>>> (although I think that would be a bad idea).
>>>>>>>
>>>>>>> best,
>>>>>>> Colin
>>>>>>>
>>>>>>>> -James
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
>>>>>>> stanislav@confluent.io> wrote:
>>>>>>>>> I renamed the KIP and that changed the link. Sorry about that. Here
>>>>>> is
>>>>>>> the
>>>>>>>>> new link:
>>>>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
>>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
>>>>>>> stanislav@confluent.io>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hey group,
>>>>>>>>>>
>>>>>>>>>> I created a new KIP about making log compaction more
>>>>> fault-tolerant.
>>>>>>>>>> Please give it a look here and please share what you think,
>>>>>>> especially in
>>>>>>>>>> regards to the points in the "Needs Discussion" paragraph.
>>>>>>>>>>
>>>>>>>>>> KIP: KIP-346
>>>>>>>>>> <
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
>>>>>>>>>> --
>>>>>>>>>> Best,
>>>>>>>>>> Stanislav
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best,
>>>>>>>>> Stanislav
>>>
>> --
>> Best,
>> Stanislav
>>
>

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Stanislav Kozlovski <st...@confluent.io>.

Hey group,

I am planning on starting a voting thread tomorrow. Please do reply if you
feel there is anything left to discuss.

Best,
Stanislav

On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <st...@confluent.io>
wrote:

> Hey, Ray
>
> Thanks for pointing that out, it's fixed now
>
> Best,
> Stanislav
>
> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rc...@apache.org> wrote:
>
>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
>> the main KIP landing page
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#>?
>>
>> I tried, but the Wiki won't let me.
>>
>> -Ray
>>
>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
>> > Hey guys,
>> >
>> > @Colin - good point. I added some sentences mentioning recent
>> improvements
>> > in the introductory section.
>> >
>> > *Disk Failure* - I tend to agree with what Colin said - once a disk
>> fails,
>> > you don't want to work with it again. As such, I've changed my mind and
>> > believe that we should mark the LogDir (assume its a disk) as offline on
>> > the first `IOException` encountered. This is the LogCleaner's current
>> > behavior. We shouldn't change that.
>> >
>> > *Respawning Threads* - I believe we should never re-spawn a thread. The
>> > correct approach in my mind is to either have it stay dead or never let
>> it
>> > die in the first place.
>> >
>> > *Uncleanable-partition-names metric* - Colin is right, this metric is
>> > unneeded. Users can monitor the `uncleanable-partitions-count` metric
>> and
>> > inspect logs.
>> >
>> >
>> > Hey Ray,
>> >
>> >> 2) I'm 100% with James in agreement with setting up the LogCleaner to
>> >> skip over problematic partitions instead of dying.
>> > I think we can do this for every exception that isn't `IOException`.
>> This
>> > will future-proof us against bugs in the system and potential other
>> errors.
>> > Protecting yourself against unexpected failures is always a good thing
>> in
>> > my mind, but I also think that protecting yourself against bugs in the
>> > software is sort of clunky. What does everybody think about this?
>> >
>> >> 4) The only improvement I can think of is that if such an
>> >> error occurs, then have the option (configuration setting?) to create a
>> >> <log_segment>.skip file (or something similar).
>> > This is a good suggestion. Have others also seen corruption be generally
>> > tied to the same segment?
>> >
>> > On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dh...@confluent.io>
>> wrote:
>> >
>> >> For the cleaner thread specifically, I do not think respawning will
>> help at
>> >> all because we are more than likely to run into the same issue again
>> which
>> >> would end up crashing the cleaner. Retrying makes sense for transient
>> >> errors or when you believe some part of the system could have healed
>> >> itself, both of which I think are not true for the log cleaner.
>> >>
>> >> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rn...@gmail.com>
>> wrote:
>> >>
>> >>> <<<respawning threads is likely to make things worse, by putting you
>> in
>> >> an
>> >>> infinite loop which consumes resources and fires off continuous log
>> >>> messages.
>> >>> Hi Colin.  In case it could be relevant, one way to mitigate this
>> effect
>> >> is
>> >>> to implement a backoff mechanism (if a second respawn is to occur then
>> >> wait
>> >>> for 1 minute before doing it; then if a third respawn is to occur wait
>> >> for
>> >>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to some
>> max
>> >>> wait time).
>> >>>
>> >>> I have no opinion on whether respawn is appropriate or not in this
>> >> context,
>> >>> but a mitigation like the increasing backoff described above may be
>> >>> relevant in weighing the pros and cons.
>> >>>
>> >>> Ron
>> >>>
>> >>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cm...@apache.org>
>> wrote:
>> >>>
>> >>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
>> >>>>> Hi Stanislav! Thanks for this KIP!
>> >>>>>
>> >>>>> I agree that it would be good if the LogCleaner were more tolerant
>> of
>> >>>>> errors. Currently, as you said, once it dies, it stays dead.
>> >>>>>
>> >>>>> Things are better now than they used to be. We have the metric
>> >>>>>        kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
>> >>>>> which we can use to tell us if the threads are dead. And as of
>> 1.1.0,
>> >>> we
>> >>>>> have KIP-226, which allows you to restart the log cleaner thread,
>> >>>>> without requiring a broker restart.
>> >>>>>
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>> >>>>> <
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>> >>>>
>> >>>>> I've only read about this, I haven't personally tried it.
>> >>>> Thanks for pointing this out, James!  Stanislav, we should probably
>> >> add a
>> >>>> sentence or two mentioning the KIP-226 changes somewhere in the KIP.
>> >>> Maybe
>> >>>> in the intro section?
>> >>>>
>> >>>> I think it's clear that requiring the users to manually restart the
>> log
>> >>>> cleaner is not a very good solution.  But it's good to know that
>> it's a
>> >>>> possibility on some older releases.
>> >>>>
>> >>>>> Some comments:
>> >>>>> * I like the idea of having the log cleaner continue to clean as
>> many
>> >>>>> partitions as it can, skipping over the problematic ones if
>> possible.
>> >>>>>
>> >>>>> * If the log cleaner thread dies, I think it should automatically be
>> >>>>> revived. Your KIP attempts to do that by catching exceptions during
>> >>>>> execution, but I think we should go all the way and make sure that a
>> >>> new
>> >>>>> one gets created, if the thread ever dies.
>> >>>> This is inconsistent with the way the rest of Kafka works.  We don't
>> >>>> automatically re-create other threads in the broker if they
>> terminate.
>> >>> In
>> >>>> general, if there is a serious bug in the code, respawning threads is
>> >>>> likely to make things worse, by putting you in an infinite loop which
>> >>>> consumes resources and fires off continuous log messages.
>> >>>>
>> >>>>> * It might be worth trying to re-clean the uncleanable partitions.
>> >> I've
>> >>>>> seen cases where an uncleanable partition later became cleanable. I
>> >>>>> unfortunately don't remember how that happened, but I remember being
>> >>>>> surprised when I discovered it. It might have been something like a
>> >>>>> follower was uncleanable but after a leader election happened, the
>> >> log
>> >>>>> truncated and it was then cleanable again. I'm not sure.
>> >>>> James, I disagree.  We had this behavior in the Hadoop Distributed
>> File
>> >>>> System (HDFS) and it was a constant source of user problems.
>> >>>>
>> >>>> What would happen is disks would just go bad over time.  The DataNode
>> >>>> would notice this and take them offline.  But then, due to some
>> >>>> "optimistic" code, the DataNode would periodically try to re-add them
>> >> to
>> >>>> the system.  Then one of two things would happen: the disk would just
>> >>> fail
>> >>>> immediately again, or it would appear to work and then fail after a
>> >> short
>> >>>> amount of time.
>> >>>>
>> >>>> The way the disk failed was normally having an I/O request take a
>> >> really
>> >>>> long time and time out.  So a bunch of request handler threads would
>> >>>> basically slam into a brick wall when they tried to access the bad
>> >> disk,
>> >>>> slowing the DataNode to a crawl.  It was even worse in the second
>> >>> scenario,
>> >>>> if the disk appeared to work for a while, but then failed.  Any data
>> >> that
>> >>>> had been written on that DataNode to that disk would be lost, and we
>> >>> would
>> >>>> need to re-replicate it.
>> >>>>
>> >>>> Disks aren't biological systems-- they don't heal over time.  Once
>> >>> they're
>> >>>> bad, they stay bad.  The log cleaner needs to be robust against cases
>> >>> where
>> >>>> the disk really is failing, and really is returning bad data or
>> timing
>> >>> out.
>> >>>>> * For your metrics, can you spell out the full metric in JMX-style
>> >>>>> format, such as:
>> >>>>>
>> >>>   kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
>> >>>>>                value=4
>> >>>>>
>> >>>>> * For "uncleanable-partitions": topic-partition names can be very
>> >> long.
>> >>>>> I think the current max size is 210 characters (or maybe 240-ish?).
>> >>>>> Having the "uncleanable-partitions" being a list could be very large
>> >>>>> metric. Also, having the metric come out as a csv might be difficult
>> >> to
>> >>>>> work with for monitoring systems. If we *did* want the topic names
>> to
>> >>> be
>> >>>>> accessible, what do you think of having the
>> >>>>>        kafka.log:type=LogCleanerManager,topic=topic1,partition=2
>> >>>>> I'm not sure if LogCleanerManager is the right type, but my example
>> >> was
>> >>>>> that the topic and partition can be tags in the metric. That will
>> >> allow
>> >>>>> monitoring systems to more easily slice and dice the metric. I'm not
>> >>>>> sure what the attribute for that metric would be. Maybe something
>> >> like
>> >>>>> "uncleaned bytes" for that topic-partition? Or
>> time-since-last-clean?
>> >>> Or
>> >>>>> maybe even just "Value=1".
>> >>>> I haven't though about this that hard, but do we really need the
>> >>>> uncleanable topic names to be accessible through a metric?  It seems
>> >> like
>> >>>> the admin should notice that uncleanable partitions are present, and
>> >> then
>> >>>> check the logs?
>> >>>>
>> >>>>> * About `max.uncleanable.partitions`, you said that this likely
>> >>>>> indicates that the disk is having problems. I'm not sure that is the
>> >>>>> case. For the 4 JIRAs that you mentioned about log cleaner problems,
>> >>> all
>> >>>>> of them are partition-level scenarios that happened during normal
>> >>>>> operation. None of them were indicative of disk problems.
>> >>>> I don't think this is a meaningful comparison.  In general, we don't
>> >>>> accept JIRAs for hard disk problems that happen on a particular
>> >> cluster.
>> >>>> If someone opened a JIRA that said "my hard disk is having problems"
>> we
>> >>>> could close that as "not a Kafka bug."  This doesn't prove that disk
>> >>>> problems don't happen, but  just that JIRA isn't the right place for
>> >>> them.
>> >>>> I do agree that the log cleaner has had a significant number of logic
>> >>>> bugs, and that we need to be careful to limit their impact.  That's
>> one
>> >>>> reason why I think that a threshold of "number of uncleanable logs"
>> is
>> >> a
>> >>>> good idea, rather than just failing after one IOException.  In all
>> the
>> >>>> cases I've seen where a user hit a logic bug in the log cleaner, it
>> was
>> >>>> just one partition that had the issue.  We also should increase test
>> >>>> coverage for the log cleaner.
>> >>>>
>> >>>>> * About marking disks as offline when exceeding a certain threshold,
>> >>>>> that actually increases the blast radius of log compaction failures.
>> >>>>> Currently, the uncleaned partitions are still readable and writable.
>> >>>>> Taking the disks offline would impact availability of the
>> uncleanable
>> >>>>> partitions, as well as impact all other partitions that are on the
>> >>> disk.
>> >>>> In general, when we encounter I/O errors, we take the disk partition
>> >>>> offline.  This is spelled out in KIP-112 (
>> >>>>
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
>> >>>> ) :
>> >>>>
>> >>>>> - Broker assumes a log directory to be good after it starts, and
>> mark
>> >>>> log directory as
>> >>>>> bad once there is IOException when broker attempts to access (i.e.
>> >> read
>> >>>> or write) the log directory.
>> >>>>> - Broker will be offline if all log directories are bad.
>> >>>>> - Broker will stop serving replicas in any bad log directory. New
>> >>>> replicas will only be created
>> >>>>> on good log directory.
>> >>>> The behavior Stanislav is proposing for the log cleaner is actually
>> >> more
>> >>>> optimistic than what we do for regular broker I/O, since we will
>> >> tolerate
>> >>>> multiple IOExceptions, not just one.  But it's generally consistent.
>> >>>> Ignoring errors is not.  In any case, if you want to tolerate an
>> >>> unlimited
>> >>>> number of I/O errors, you can just set the threshold to an infinite
>> >> value
>> >>>> (although I think that would be a bad idea).
>> >>>>
>> >>>> best,
>> >>>> Colin
>> >>>>
>> >>>>> -James
>> >>>>>
>> >>>>>
>> >>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
>> >>>> stanislav@confluent.io> wrote:
>> >>>>>> I renamed the KIP and that changed the link. Sorry about that. Here
>> >>> is
>> >>>> the
>> >>>>>> new link:
>> >>>>>>
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
>> >>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
>> >>>> stanislav@confluent.io>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> Hey group,
>> >>>>>>>
>> >>>>>>> I created a new KIP about making log compaction more
>> >> fault-tolerant.
>> >>>>>>> Please give it a look here and please share what you think,
>> >>>> especially in
>> >>>>>>> regards to the points in the "Needs Discussion" paragraph.
>> >>>>>>>
>> >>>>>>> KIP: KIP-346
>> >>>>>>> <
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
>> >>>>>>> --
>> >>>>>>> Best,
>> >>>>>>> Stanislav
>> >>>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Best,
>> >>>>>> Stanislav
>> >
>>
>>
>
> --
> Best,
> Stanislav
>


-- 
Best,
Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Stanislav Kozlovski <st...@confluent.io>.

Hey, Ray

Thanks for pointing that out, it's fixed now

Best,
Stanislav

On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rc...@apache.org> wrote:

> Thanks.  Can you fix the link in the "KIPs under discussion" table on
> the main KIP landing page
> <
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#>?
>
> I tried, but the Wiki won't let me.
>
> -Ray
>
> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
> > Hey guys,
> >
> > @Colin - good point. I added some sentences mentioning recent
> improvements
> > in the introductory section.
> >
> > *Disk Failure* - I tend to agree with what Colin said - once a disk
> fails,
> > you don't want to work with it again. As such, I've changed my mind and
> > believe that we should mark the LogDir (assume its a disk) as offline on
> > the first `IOException` encountered. This is the LogCleaner's current
> > behavior. We shouldn't change that.
> >
> > *Respawning Threads* - I believe we should never re-spawn a thread. The
> > correct approach in my mind is to either have it stay dead or never let
> it
> > die in the first place.
> >
> > *Uncleanable-partition-names metric* - Colin is right, this metric is
> > unneeded. Users can monitor the `uncleanable-partitions-count` metric and
> > inspect logs.
> >
> >
> > Hey Ray,
> >
> >> 2) I'm 100% with James in agreement with setting up the LogCleaner to
> >> skip over problematic partitions instead of dying.
> > I think we can do this for every exception that isn't `IOException`. This
> > will future-proof us against bugs in the system and potential other
> errors.
> > Protecting yourself against unexpected failures is always a good thing in
> > my mind, but I also think that protecting yourself against bugs in the
> > software is sort of clunky. What does everybody think about this?
> >
> >> 4) The only improvement I can think of is that if such an
> >> error occurs, then have the option (configuration setting?) to create a
> >> <log_segment>.skip file (or something similar).
> > This is a good suggestion. Have others also seen corruption be generally
> > tied to the same segment?
> >
> > On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dh...@confluent.io>
> wrote:
> >
> >> For the cleaner thread specifically, I do not think respawning will
> help at
> >> all because we are more than likely to run into the same issue again
> which
> >> would end up crashing the cleaner. Retrying makes sense for transient
> >> errors or when you believe some part of the system could have healed
> >> itself, both of which I think are not true for the log cleaner.
> >>
> >> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rn...@gmail.com>
> wrote:
> >>
> >>> <<<respawning threads is likely to make things worse, by putting you in
> >> an
> >>> infinite loop which consumes resources and fires off continuous log
> >>> messages.
> >>> Hi Colin.  In case it could be relevant, one way to mitigate this
> effect
> >> is
> >>> to implement a backoff mechanism (if a second respawn is to occur then
> >> wait
> >>> for 1 minute before doing it; then if a third respawn is to occur wait
> >> for
> >>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to some
> max
> >>> wait time).
> >>>
> >>> I have no opinion on whether respawn is appropriate or not in this
> >> context,
> >>> but a mitigation like the increasing backoff described above may be
> >>> relevant in weighing the pros and cons.
> >>>
> >>> Ron
> >>>
> >>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cm...@apache.org>
> wrote:
> >>>
> >>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> >>>>> Hi Stanislav! Thanks for this KIP!
> >>>>>
> >>>>> I agree that it would be good if the LogCleaner were more tolerant of
> >>>>> errors. Currently, as you said, once it dies, it stays dead.
> >>>>>
> >>>>> Things are better now than they used to be. We have the metric
> >>>>>        kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
> >>>>> which we can use to tell us if the threads are dead. And as of 1.1.0,
> >>> we
> >>>>> have KIP-226, which allows you to restart the log cleaner thread,
> >>>>> without requiring a broker restart.
> >>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> >>>>> <
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> >>>>
> >>>>> I've only read about this, I haven't personally tried it.
> >>>> Thanks for pointing this out, James!  Stanislav, we should probably
> >> add a
> >>>> sentence or two mentioning the KIP-226 changes somewhere in the KIP.
> >>> Maybe
> >>>> in the intro section?
> >>>>
> >>>> I think it's clear that requiring the users to manually restart the
> log
> >>>> cleaner is not a very good solution.  But it's good to know that it's
> a
> >>>> possibility on some older releases.
> >>>>
> >>>>> Some comments:
> >>>>> * I like the idea of having the log cleaner continue to clean as many
> >>>>> partitions as it can, skipping over the problematic ones if possible.
> >>>>>
> >>>>> * If the log cleaner thread dies, I think it should automatically be
> >>>>> revived. Your KIP attempts to do that by catching exceptions during
> >>>>> execution, but I think we should go all the way and make sure that a
> >>> new
> >>>>> one gets created, if the thread ever dies.
> >>>> This is inconsistent with the way the rest of Kafka works.  We don't
> >>>> automatically re-create other threads in the broker if they terminate.
> >>> In
> >>>> general, if there is a serious bug in the code, respawning threads is
> >>>> likely to make things worse, by putting you in an infinite loop which
> >>>> consumes resources and fires off continuous log messages.
> >>>>
> >>>>> * It might be worth trying to re-clean the uncleanable partitions.
> >> I've
> >>>>> seen cases where an uncleanable partition later became cleanable. I
> >>>>> unfortunately don't remember how that happened, but I remember being
> >>>>> surprised when I discovered it. It might have been something like a
> >>>>> follower was uncleanable but after a leader election happened, the
> >> log
> >>>>> truncated and it was then cleanable again. I'm not sure.
> >>>> James, I disagree.  We had this behavior in the Hadoop Distributed
> File
> >>>> System (HDFS) and it was a constant source of user problems.
> >>>>
> >>>> What would happen is disks would just go bad over time.  The DataNode
> >>>> would notice this and take them offline.  But then, due to some
> >>>> "optimistic" code, the DataNode would periodically try to re-add them
> >> to
> >>>> the system.  Then one of two things would happen: the disk would just
> >>> fail
> >>>> immediately again, or it would appear to work and then fail after a
> >> short
> >>>> amount of time.
> >>>>
> >>>> The way the disk failed was normally having an I/O request take a
> >> really
> >>>> long time and time out.  So a bunch of request handler threads would
> >>>> basically slam into a brick wall when they tried to access the bad
> >> disk,
> >>>> slowing the DataNode to a crawl.  It was even worse in the second
> >>> scenario,
> >>>> if the disk appeared to work for a while, but then failed.  Any data
> >> that
> >>>> had been written on that DataNode to that disk would be lost, and we
> >>> would
> >>>> need to re-replicate it.
> >>>>
> >>>> Disks aren't biological systems-- they don't heal over time.  Once
> >>> they're
> >>>> bad, they stay bad.  The log cleaner needs to be robust against cases
> >>> where
> >>>> the disk really is failing, and really is returning bad data or timing
> >>> out.
> >>>>> * For your metrics, can you spell out the full metric in JMX-style
> >>>>> format, such as:
> >>>>>
> >>>   kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
> >>>>>                value=4
> >>>>>
> >>>>> * For "uncleanable-partitions": topic-partition names can be very
> >> long.
> >>>>> I think the current max size is 210 characters (or maybe 240-ish?).
> >>>>> Having the "uncleanable-partitions" being a list could be very large
> >>>>> metric. Also, having the metric come out as a csv might be difficult
> >> to
> >>>>> work with for monitoring systems. If we *did* want the topic names to
> >>> be
> >>>>> accessible, what do you think of having the
> >>>>>        kafka.log:type=LogCleanerManager,topic=topic1,partition=2
> >>>>> I'm not sure if LogCleanerManager is the right type, but my example
> >> was
> >>>>> that the topic and partition can be tags in the metric. That will
> >> allow
> >>>>> monitoring systems to more easily slice and dice the metric. I'm not
> >>>>> sure what the attribute for that metric would be. Maybe something
> >> like
> >>>>> "uncleaned bytes" for that topic-partition? Or time-since-last-clean?
> >>> Or
> >>>>> maybe even just "Value=1".
> >>>> I haven't though about this that hard, but do we really need the
> >>>> uncleanable topic names to be accessible through a metric?  It seems
> >> like
> >>>> the admin should notice that uncleanable partitions are present, and
> >> then
> >>>> check the logs?
> >>>>
> >>>>> * About `max.uncleanable.partitions`, you said that this likely
> >>>>> indicates that the disk is having problems. I'm not sure that is the
> >>>>> case. For the 4 JIRAs that you mentioned about log cleaner problems,
> >>> all
> >>>>> of them are partition-level scenarios that happened during normal
> >>>>> operation. None of them were indicative of disk problems.
> >>>> I don't think this is a meaningful comparison.  In general, we don't
> >>>> accept JIRAs for hard disk problems that happen on a particular
> >> cluster.
> >>>> If someone opened a JIRA that said "my hard disk is having problems"
> we
> >>>> could close that as "not a Kafka bug."  This doesn't prove that disk
> >>>> problems don't happen, but  just that JIRA isn't the right place for
> >>> them.
> >>>> I do agree that the log cleaner has had a significant number of logic
> >>>> bugs, and that we need to be careful to limit their impact.  That's
> one
> >>>> reason why I think that a threshold of "number of uncleanable logs" is
> >> a
> >>>> good idea, rather than just failing after one IOException.  In all the
> >>>> cases I've seen where a user hit a logic bug in the log cleaner, it
> was
> >>>> just one partition that had the issue.  We also should increase test
> >>>> coverage for the log cleaner.
> >>>>
> >>>>> * About marking disks as offline when exceeding a certain threshold,
> >>>>> that actually increases the blast radius of log compaction failures.
> >>>>> Currently, the uncleaned partitions are still readable and writable.
> >>>>> Taking the disks offline would impact availability of the uncleanable
> >>>>> partitions, as well as impact all other partitions that are on the
> >>> disk.
> >>>> In general, when we encounter I/O errors, we take the disk partition
> >>>> offline.  This is spelled out in KIP-112 (
> >>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
> >>>> ) :
> >>>>
> >>>>> - Broker assumes a log directory to be good after it starts, and mark
> >>>> log directory as
> >>>>> bad once there is IOException when broker attempts to access (i.e.
> >> read
> >>>> or write) the log directory.
> >>>>> - Broker will be offline if all log directories are bad.
> >>>>> - Broker will stop serving replicas in any bad log directory. New
> >>>> replicas will only be created
> >>>>> on good log directory.
> >>>> The behavior Stanislav is proposing for the log cleaner is actually
> >> more
> >>>> optimistic than what we do for regular broker I/O, since we will
> >> tolerate
> >>>> multiple IOExceptions, not just one.  But it's generally consistent.
> >>>> Ignoring errors is not.  In any case, if you want to tolerate an
> >>> unlimited
> >>>> number of I/O errors, you can just set the threshold to an infinite
> >> value
> >>>> (although I think that would be a bad idea).
> >>>>
> >>>> best,
> >>>> Colin
> >>>>
> >>>>> -James
> >>>>>
> >>>>>
> >>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
> >>>> stanislav@confluent.io> wrote:
> >>>>>> I renamed the KIP and that changed the link. Sorry about that. Here
> >>> is
> >>>> the
> >>>>>> new link:
> >>>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> >>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
> >>>> stanislav@confluent.io>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hey group,
> >>>>>>>
> >>>>>>> I created a new KIP about making log compaction more
> >> fault-tolerant.
> >>>>>>> Please give it a look here and please share what you think,
> >>>> especially in
> >>>>>>> regards to the points in the "Needs Discussion" paragraph.
> >>>>>>>
> >>>>>>> KIP: KIP-346
> >>>>>>> <
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
> >>>>>>> --
> >>>>>>> Best,
> >>>>>>> Stanislav
> >>>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Best,
> >>>>>> Stanislav
> >
>
>

-- 
Best,
Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Ray Chiang <rc...@apache.org>.

Thanks.  Can you fix the link in the "KIPs under discussion" table on 
the main KIP landing page 
<https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#>?  
I tried, but the Wiki won't let me.

-Ray

On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
> Hey guys,
>
> @Colin - good point. I added some sentences mentioning recent improvements
> in the introductory section.
>
> *Disk Failure* - I tend to agree with what Colin said - once a disk fails,
> you don't want to work with it again. As such, I've changed my mind and
> believe that we should mark the LogDir (assume its a disk) as offline on
> the first `IOException` encountered. This is the LogCleaner's current
> behavior. We shouldn't change that.
>
> *Respawning Threads* - I believe we should never re-spawn a thread. The
> correct approach in my mind is to either have it stay dead or never let it
> die in the first place.
>
> *Uncleanable-partition-names metric* - Colin is right, this metric is
> unneeded. Users can monitor the `uncleanable-partitions-count` metric and
> inspect logs.
>
>
> Hey Ray,
>
>> 2) I'm 100% with James in agreement with setting up the LogCleaner to
>> skip over problematic partitions instead of dying.
> I think we can do this for every exception that isn't `IOException`. This
> will future-proof us against bugs in the system and potential other errors.
> Protecting yourself against unexpected failures is always a good thing in
> my mind, but I also think that protecting yourself against bugs in the
> software is sort of clunky. What does everybody think about this?
>
>> 4) The only improvement I can think of is that if such an
>> error occurs, then have the option (configuration setting?) to create a
>> <log_segment>.skip file (or something similar).
> This is a good suggestion. Have others also seen corruption be generally
> tied to the same segment?
>
> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dh...@confluent.io> wrote:
>
>> For the cleaner thread specifically, I do not think respawning will help at
>> all because we are more than likely to run into the same issue again which
>> would end up crashing the cleaner. Retrying makes sense for transient
>> errors or when you believe some part of the system could have healed
>> itself, both of which I think are not true for the log cleaner.
>>
>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rn...@gmail.com> wrote:
>>
>>> <<<respawning threads is likely to make things worse, by putting you in
>> an
>>> infinite loop which consumes resources and fires off continuous log
>>> messages.
>>> Hi Colin.  In case it could be relevant, one way to mitigate this effect
>> is
>>> to implement a backoff mechanism (if a second respawn is to occur then
>> wait
>>> for 1 minute before doing it; then if a third respawn is to occur wait
>> for
>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to some max
>>> wait time).
>>>
>>> I have no opinion on whether respawn is appropriate or not in this
>> context,
>>> but a mitigation like the increasing backoff described above may be
>>> relevant in weighing the pros and cons.
>>>
>>> Ron
>>>
>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cm...@apache.org> wrote:
>>>
>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
>>>>> Hi Stanislav! Thanks for this KIP!
>>>>>
>>>>> I agree that it would be good if the LogCleaner were more tolerant of
>>>>> errors. Currently, as you said, once it dies, it stays dead.
>>>>>
>>>>> Things are better now than they used to be. We have the metric
>>>>>        kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
>>>>> which we can use to tell us if the threads are dead. And as of 1.1.0,
>>> we
>>>>> have KIP-226, which allows you to restart the log cleaner thread,
>>>>> without requiring a broker restart.
>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>>>>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>>>>
>>>>> I've only read about this, I haven't personally tried it.
>>>> Thanks for pointing this out, James!  Stanislav, we should probably
>> add a
>>>> sentence or two mentioning the KIP-226 changes somewhere in the KIP.
>>> Maybe
>>>> in the intro section?
>>>>
>>>> I think it's clear that requiring the users to manually restart the log
>>>> cleaner is not a very good solution.  But it's good to know that it's a
>>>> possibility on some older releases.
>>>>
>>>>> Some comments:
>>>>> * I like the idea of having the log cleaner continue to clean as many
>>>>> partitions as it can, skipping over the problematic ones if possible.
>>>>>
>>>>> * If the log cleaner thread dies, I think it should automatically be
>>>>> revived. Your KIP attempts to do that by catching exceptions during
>>>>> execution, but I think we should go all the way and make sure that a
>>> new
>>>>> one gets created, if the thread ever dies.
>>>> This is inconsistent with the way the rest of Kafka works.  We don't
>>>> automatically re-create other threads in the broker if they terminate.
>>> In
>>>> general, if there is a serious bug in the code, respawning threads is
>>>> likely to make things worse, by putting you in an infinite loop which
>>>> consumes resources and fires off continuous log messages.
>>>>
>>>>> * It might be worth trying to re-clean the uncleanable partitions.
>> I've
>>>>> seen cases where an uncleanable partition later became cleanable. I
>>>>> unfortunately don't remember how that happened, but I remember being
>>>>> surprised when I discovered it. It might have been something like a
>>>>> follower was uncleanable but after a leader election happened, the
>> log
>>>>> truncated and it was then cleanable again. I'm not sure.
>>>> James, I disagree.  We had this behavior in the Hadoop Distributed File
>>>> System (HDFS) and it was a constant source of user problems.
>>>>
>>>> What would happen is disks would just go bad over time.  The DataNode
>>>> would notice this and take them offline.  But then, due to some
>>>> "optimistic" code, the DataNode would periodically try to re-add them
>> to
>>>> the system.  Then one of two things would happen: the disk would just
>>> fail
>>>> immediately again, or it would appear to work and then fail after a
>> short
>>>> amount of time.
>>>>
>>>> The way the disk failed was normally having an I/O request take a
>> really
>>>> long time and time out.  So a bunch of request handler threads would
>>>> basically slam into a brick wall when they tried to access the bad
>> disk,
>>>> slowing the DataNode to a crawl.  It was even worse in the second
>>> scenario,
>>>> if the disk appeared to work for a while, but then failed.  Any data
>> that
>>>> had been written on that DataNode to that disk would be lost, and we
>>> would
>>>> need to re-replicate it.
>>>>
>>>> Disks aren't biological systems-- they don't heal over time.  Once
>>> they're
>>>> bad, they stay bad.  The log cleaner needs to be robust against cases
>>> where
>>>> the disk really is failing, and really is returning bad data or timing
>>> out.
>>>>> * For your metrics, can you spell out the full metric in JMX-style
>>>>> format, such as:
>>>>>
>>>   kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
>>>>>                value=4
>>>>>
>>>>> * For "uncleanable-partitions": topic-partition names can be very
>> long.
>>>>> I think the current max size is 210 characters (or maybe 240-ish?).
>>>>> Having the "uncleanable-partitions" being a list could be very large
>>>>> metric. Also, having the metric come out as a csv might be difficult
>> to
>>>>> work with for monitoring systems. If we *did* want the topic names to
>>> be
>>>>> accessible, what do you think of having the
>>>>>        kafka.log:type=LogCleanerManager,topic=topic1,partition=2
>>>>> I'm not sure if LogCleanerManager is the right type, but my example
>> was
>>>>> that the topic and partition can be tags in the metric. That will
>> allow
>>>>> monitoring systems to more easily slice and dice the metric. I'm not
>>>>> sure what the attribute for that metric would be. Maybe something
>> like
>>>>> "uncleaned bytes" for that topic-partition? Or time-since-last-clean?
>>> Or
>>>>> maybe even just "Value=1".
>>>> I haven't though about this that hard, but do we really need the
>>>> uncleanable topic names to be accessible through a metric?  It seems
>> like
>>>> the admin should notice that uncleanable partitions are present, and
>> then
>>>> check the logs?
>>>>
>>>>> * About `max.uncleanable.partitions`, you said that this likely
>>>>> indicates that the disk is having problems. I'm not sure that is the
>>>>> case. For the 4 JIRAs that you mentioned about log cleaner problems,
>>> all
>>>>> of them are partition-level scenarios that happened during normal
>>>>> operation. None of them were indicative of disk problems.
>>>> I don't think this is a meaningful comparison.  In general, we don't
>>>> accept JIRAs for hard disk problems that happen on a particular
>> cluster.
>>>> If someone opened a JIRA that said "my hard disk is having problems" we
>>>> could close that as "not a Kafka bug."  This doesn't prove that disk
>>>> problems don't happen, but  just that JIRA isn't the right place for
>>> them.
>>>> I do agree that the log cleaner has had a significant number of logic
>>>> bugs, and that we need to be careful to limit their impact.  That's one
>>>> reason why I think that a threshold of "number of uncleanable logs" is
>> a
>>>> good idea, rather than just failing after one IOException.  In all the
>>>> cases I've seen where a user hit a logic bug in the log cleaner, it was
>>>> just one partition that had the issue.  We also should increase test
>>>> coverage for the log cleaner.
>>>>
>>>>> * About marking disks as offline when exceeding a certain threshold,
>>>>> that actually increases the blast radius of log compaction failures.
>>>>> Currently, the uncleaned partitions are still readable and writable.
>>>>> Taking the disks offline would impact availability of the uncleanable
>>>>> partitions, as well as impact all other partitions that are on the
>>> disk.
>>>> In general, when we encounter I/O errors, we take the disk partition
>>>> offline.  This is spelled out in KIP-112 (
>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
>>>> ) :
>>>>
>>>>> - Broker assumes a log directory to be good after it starts, and mark
>>>> log directory as
>>>>> bad once there is IOException when broker attempts to access (i.e.
>> read
>>>> or write) the log directory.
>>>>> - Broker will be offline if all log directories are bad.
>>>>> - Broker will stop serving replicas in any bad log directory. New
>>>> replicas will only be created
>>>>> on good log directory.
>>>> The behavior Stanislav is proposing for the log cleaner is actually
>> more
>>>> optimistic than what we do for regular broker I/O, since we will
>> tolerate
>>>> multiple IOExceptions, not just one.  But it's generally consistent.
>>>> Ignoring errors is not.  In any case, if you want to tolerate an
>>> unlimited
>>>> number of I/O errors, you can just set the threshold to an infinite
>> value
>>>> (although I think that would be a bad idea).
>>>>
>>>> best,
>>>> Colin
>>>>
>>>>> -James
>>>>>
>>>>>
>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
>>>> stanislav@confluent.io> wrote:
>>>>>> I renamed the KIP and that changed the link. Sorry about that. Here
>>> is
>>>> the
>>>>>> new link:
>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
>>>> stanislav@confluent.io>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey group,
>>>>>>>
>>>>>>> I created a new KIP about making log compaction more
>> fault-tolerant.
>>>>>>> Please give it a look here and please share what you think,
>>>> especially in
>>>>>>> regards to the points in the "Needs Discussion" paragraph.
>>>>>>>
>>>>>>> KIP: KIP-346
>>>>>>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
>>>>>>> --
>>>>>>> Best,
>>>>>>> Stanislav
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best,
>>>>>> Stanislav
>

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Stanislav Kozlovski <st...@confluent.io>.

Hey guys,

@Colin - good point. I added some sentences mentioning recent improvements
in the introductory section.

*Disk Failure* - I tend to agree with what Colin said - once a disk fails,
you don't want to work with it again. As such, I've changed my mind and
believe that we should mark the LogDir (assume its a disk) as offline on
the first `IOException` encountered. This is the LogCleaner's current
behavior. We shouldn't change that.

*Respawning Threads* - I believe we should never re-spawn a thread. The
correct approach in my mind is to either have it stay dead or never let it
die in the first place.

*Uncleanable-partition-names metric* - Colin is right, this metric is
unneeded. Users can monitor the `uncleanable-partitions-count` metric and
inspect logs.


Hey Ray,

> 2) I'm 100% with James in agreement with setting up the LogCleaner to
> skip over problematic partitions instead of dying.
I think we can do this for every exception that isn't `IOException`. This
will future-proof us against bugs in the system and potential other errors.
Protecting yourself against unexpected failures is always a good thing in
my mind, but I also think that protecting yourself against bugs in the
software is sort of clunky. What does everybody think about this?

> 4) The only improvement I can think of is that if such an
> error occurs, then have the option (configuration setting?) to create a
> <log_segment>.skip file (or something similar).
This is a good suggestion. Have others also seen corruption be generally
tied to the same segment?

On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dh...@confluent.io> wrote:

> For the cleaner thread specifically, I do not think respawning will help at
> all because we are more than likely to run into the same issue again which
> would end up crashing the cleaner. Retrying makes sense for transient
> errors or when you believe some part of the system could have healed
> itself, both of which I think are not true for the log cleaner.
>
> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rn...@gmail.com> wrote:
>
> > <<<respawning threads is likely to make things worse, by putting you in
> an
> > infinite loop which consumes resources and fires off continuous log
> > messages.
> > Hi Colin.  In case it could be relevant, one way to mitigate this effect
> is
> > to implement a backoff mechanism (if a second respawn is to occur then
> wait
> > for 1 minute before doing it; then if a third respawn is to occur wait
> for
> > 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to some max
> > wait time).
> >
> > I have no opinion on whether respawn is appropriate or not in this
> context,
> > but a mitigation like the increasing backoff described above may be
> > relevant in weighing the pros and cons.
> >
> > Ron
> >
> > On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cm...@apache.org> wrote:
> >
> > > On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> > > > Hi Stanislav! Thanks for this KIP!
> > > >
> > > > I agree that it would be good if the LogCleaner were more tolerant of
> > > > errors. Currently, as you said, once it dies, it stays dead.
> > > >
> > > > Things are better now than they used to be. We have the metric
> > > >       kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
> > > > which we can use to tell us if the threads are dead. And as of 1.1.0,
> > we
> > > > have KIP-226, which allows you to restart the log cleaner thread,
> > > > without requiring a broker restart.
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> > > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> > >
> > >
> > > > I've only read about this, I haven't personally tried it.
> > >
> > > Thanks for pointing this out, James!  Stanislav, we should probably
> add a
> > > sentence or two mentioning the KIP-226 changes somewhere in the KIP.
> > Maybe
> > > in the intro section?
> > >
> > > I think it's clear that requiring the users to manually restart the log
> > > cleaner is not a very good solution.  But it's good to know that it's a
> > > possibility on some older releases.
> > >
> > > >
> > > > Some comments:
> > > > * I like the idea of having the log cleaner continue to clean as many
> > > > partitions as it can, skipping over the problematic ones if possible.
> > > >
> > > > * If the log cleaner thread dies, I think it should automatically be
> > > > revived. Your KIP attempts to do that by catching exceptions during
> > > > execution, but I think we should go all the way and make sure that a
> > new
> > > > one gets created, if the thread ever dies.
> > >
> > > This is inconsistent with the way the rest of Kafka works.  We don't
> > > automatically re-create other threads in the broker if they terminate.
> > In
> > > general, if there is a serious bug in the code, respawning threads is
> > > likely to make things worse, by putting you in an infinite loop which
> > > consumes resources and fires off continuous log messages.
> > >
> > > >
> > > > * It might be worth trying to re-clean the uncleanable partitions.
> I've
> > > > seen cases where an uncleanable partition later became cleanable. I
> > > > unfortunately don't remember how that happened, but I remember being
> > > > surprised when I discovered it. It might have been something like a
> > > > follower was uncleanable but after a leader election happened, the
> log
> > > > truncated and it was then cleanable again. I'm not sure.
> > >
> > > James, I disagree.  We had this behavior in the Hadoop Distributed File
> > > System (HDFS) and it was a constant source of user problems.
> > >
> > > What would happen is disks would just go bad over time.  The DataNode
> > > would notice this and take them offline.  But then, due to some
> > > "optimistic" code, the DataNode would periodically try to re-add them
> to
> > > the system.  Then one of two things would happen: the disk would just
> > fail
> > > immediately again, or it would appear to work and then fail after a
> short
> > > amount of time.
> > >
> > > The way the disk failed was normally having an I/O request take a
> really
> > > long time and time out.  So a bunch of request handler threads would
> > > basically slam into a brick wall when they tried to access the bad
> disk,
> > > slowing the DataNode to a crawl.  It was even worse in the second
> > scenario,
> > > if the disk appeared to work for a while, but then failed.  Any data
> that
> > > had been written on that DataNode to that disk would be lost, and we
> > would
> > > need to re-replicate it.
> > >
> > > Disks aren't biological systems-- they don't heal over time.  Once
> > they're
> > > bad, they stay bad.  The log cleaner needs to be robust against cases
> > where
> > > the disk really is failing, and really is returning bad data or timing
> > out.
> > >
> > > >
> > > > * For your metrics, can you spell out the full metric in JMX-style
> > > > format, such as:
> > > >
> >  kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
> > > >               value=4
> > > >
> > > > * For "uncleanable-partitions": topic-partition names can be very
> long.
> > > > I think the current max size is 210 characters (or maybe 240-ish?).
> > > > Having the "uncleanable-partitions" being a list could be very large
> > > > metric. Also, having the metric come out as a csv might be difficult
> to
> > > > work with for monitoring systems. If we *did* want the topic names to
> > be
> > > > accessible, what do you think of having the
> > > >       kafka.log:type=LogCleanerManager,topic=topic1,partition=2
> > > > I'm not sure if LogCleanerManager is the right type, but my example
> was
> > > > that the topic and partition can be tags in the metric. That will
> allow
> > > > monitoring systems to more easily slice and dice the metric. I'm not
> > > > sure what the attribute for that metric would be. Maybe something
> like
> > > > "uncleaned bytes" for that topic-partition? Or time-since-last-clean?
> > Or
> > > > maybe even just "Value=1".
> > >
> > > I haven't though about this that hard, but do we really need the
> > > uncleanable topic names to be accessible through a metric?  It seems
> like
> > > the admin should notice that uncleanable partitions are present, and
> then
> > > check the logs?
> > >
> > > >
> > > > * About `max.uncleanable.partitions`, you said that this likely
> > > > indicates that the disk is having problems. I'm not sure that is the
> > > > case. For the 4 JIRAs that you mentioned about log cleaner problems,
> > all
> > > > of them are partition-level scenarios that happened during normal
> > > > operation. None of them were indicative of disk problems.
> > >
> > > I don't think this is a meaningful comparison.  In general, we don't
> > > accept JIRAs for hard disk problems that happen on a particular
> cluster.
> > > If someone opened a JIRA that said "my hard disk is having problems" we
> > > could close that as "not a Kafka bug."  This doesn't prove that disk
> > > problems don't happen, but  just that JIRA isn't the right place for
> > them.
> > >
> > > I do agree that the log cleaner has had a significant number of logic
> > > bugs, and that we need to be careful to limit their impact.  That's one
> > > reason why I think that a threshold of "number of uncleanable logs" is
> a
> > > good idea, rather than just failing after one IOException.  In all the
> > > cases I've seen where a user hit a logic bug in the log cleaner, it was
> > > just one partition that had the issue.  We also should increase test
> > > coverage for the log cleaner.
> > >
> > > > * About marking disks as offline when exceeding a certain threshold,
> > > > that actually increases the blast radius of log compaction failures.
> > > > Currently, the uncleaned partitions are still readable and writable.
> > > > Taking the disks offline would impact availability of the uncleanable
> > > > partitions, as well as impact all other partitions that are on the
> > disk.
> > >
> > > In general, when we encounter I/O errors, we take the disk partition
> > > offline.  This is spelled out in KIP-112 (
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
> > > ) :
> > >
> > > > - Broker assumes a log directory to be good after it starts, and mark
> > > log directory as
> > > > bad once there is IOException when broker attempts to access (i.e.
> read
> > > or write) the log directory.
> > > > - Broker will be offline if all log directories are bad.
> > > > - Broker will stop serving replicas in any bad log directory. New
> > > replicas will only be created
> > > > on good log directory.
> > >
> > > The behavior Stanislav is proposing for the log cleaner is actually
> more
> > > optimistic than what we do for regular broker I/O, since we will
> tolerate
> > > multiple IOExceptions, not just one.  But it's generally consistent.
> > > Ignoring errors is not.  In any case, if you want to tolerate an
> > unlimited
> > > number of I/O errors, you can just set the threshold to an infinite
> value
> > > (although I think that would be a bad idea).
> > >
> > > best,
> > > Colin
> > >
> > > >
> > > > -James
> > > >
> > > >
> > > > > On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
> > > stanislav@confluent.io> wrote:
> > > > >
> > > > > I renamed the KIP and that changed the link. Sorry about that. Here
> > is
> > > the
> > > > > new link:
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> > > > >
> > > > > On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
> > > stanislav@confluent.io>
> > > > > wrote:
> > > > >
> > > > >> Hey group,
> > > > >>
> > > > >> I created a new KIP about making log compaction more
> fault-tolerant.
> > > > >> Please give it a look here and please share what you think,
> > > especially in
> > > > >> regards to the points in the "Needs Discussion" paragraph.
> > > > >>
> > > > >> KIP: KIP-346
> > > > >> <
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
> > > >
> > > > >> --
> > > > >> Best,
> > > > >> Stanislav
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > Best,
> > > > > Stanislav
> > > >
> > >
> >
>


-- 
Best,
Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Ray Chiang <rc...@apache.org>.

Thanks for creating this KIP Stanislav.  My observations:

1) I agree with Colin that threads automatically re-launching threads 
generally isn't a great idea.  Metrics and/or monitoring threads are 
generally much safer.  And there's always the issue of what happens if 
the re-launcher dies?

2) I'm 100% with James in agreement with setting up the LogCleaner to 
skip over problematic partitions instead of dying.

3) There's a lot of "feature bloat" suggestions.  From how I see things, 
a message could get corrupted in one of several states:

3a) Message is corrupted by the leader partition saving to disk. 
Replicas have the same error.
3b) Message is corrupted by one of the replica partitions saving to 
disk.  Leader and other replica(s) unlikely to have the same error
3c) Disk corruption happens later (e.g. during partition move)

If we have the simplest solution, then all of the above will not cause 
the LogCleaner to crash and 3b/3c have a chance of manual recovery.

4) In most of the issues I'm seeing via work, most of the corruption 
seems persistent on the same log segment (i.e. a 3b/3c type of 
corruption).  The only improvement I can think of is that if such an 
error occurs, then have the option (configuration setting?) to create a 
<log_segment>.skip file (or something similar).  If the .skip file is 
there, don't re-scan the segment.  If you want a re-try or manage to fix 
the issue manually (e.g. copying from a replica), then the .skip file 
can be deleted after the segment is fixed and the LogCleaner will try 
again on the next iteration.

5) I'm in alignment with Colin's comment about hard drive failures. By 
the time you can reliably detect HDD hardware failures, it's less about 
improving the LogCleaner as much as that data needs to be moved to a new 
drive.

-Ray

On 7/25/18 11:55 AM, Dhruvil Shah wrote:
> For the cleaner thread specifically, I do not think respawning will help at
> all because we are more than likely to run into the same issue again which
> would end up crashing the cleaner. Retrying makes sense for transient
> errors or when you believe some part of the system could have healed
> itself, both of which I think are not true for the log cleaner.
>
> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rn...@gmail.com> wrote:
>
>> <<<respawning threads is likely to make things worse, by putting you in an
>> infinite loop which consumes resources and fires off continuous log
>> messages.
>> Hi Colin.  In case it could be relevant, one way to mitigate this effect is
>> to implement a backoff mechanism (if a second respawn is to occur then wait
>> for 1 minute before doing it; then if a third respawn is to occur wait for
>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to some max
>> wait time).
>>
>> I have no opinion on whether respawn is appropriate or not in this context,
>> but a mitigation like the increasing backoff described above may be
>> relevant in weighing the pros and cons.
>>
>> Ron
>>
>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cm...@apache.org> wrote:
>>
>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
>>>> Hi Stanislav! Thanks for this KIP!
>>>>
>>>> I agree that it would be good if the LogCleaner were more tolerant of
>>>> errors. Currently, as you said, once it dies, it stays dead.
>>>>
>>>> Things are better now than they used to be. We have the metric
>>>>        kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
>>>> which we can use to tell us if the threads are dead. And as of 1.1.0,
>> we
>>>> have KIP-226, which allows you to restart the log cleaner thread,
>>>> without requiring a broker restart.
>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>>>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>>>
>>>> I've only read about this, I haven't personally tried it.
>>> Thanks for pointing this out, James!  Stanislav, we should probably add a
>>> sentence or two mentioning the KIP-226 changes somewhere in the KIP.
>> Maybe
>>> in the intro section?
>>>
>>> I think it's clear that requiring the users to manually restart the log
>>> cleaner is not a very good solution.  But it's good to know that it's a
>>> possibility on some older releases.
>>>
>>>> Some comments:
>>>> * I like the idea of having the log cleaner continue to clean as many
>>>> partitions as it can, skipping over the problematic ones if possible.
>>>>
>>>> * If the log cleaner thread dies, I think it should automatically be
>>>> revived. Your KIP attempts to do that by catching exceptions during
>>>> execution, but I think we should go all the way and make sure that a
>> new
>>>> one gets created, if the thread ever dies.
>>> This is inconsistent with the way the rest of Kafka works.  We don't
>>> automatically re-create other threads in the broker if they terminate.
>> In
>>> general, if there is a serious bug in the code, respawning threads is
>>> likely to make things worse, by putting you in an infinite loop which
>>> consumes resources and fires off continuous log messages.
>>>
>>>> * It might be worth trying to re-clean the uncleanable partitions. I've
>>>> seen cases where an uncleanable partition later became cleanable. I
>>>> unfortunately don't remember how that happened, but I remember being
>>>> surprised when I discovered it. It might have been something like a
>>>> follower was uncleanable but after a leader election happened, the log
>>>> truncated and it was then cleanable again. I'm not sure.
>>> James, I disagree.  We had this behavior in the Hadoop Distributed File
>>> System (HDFS) and it was a constant source of user problems.
>>>
>>> What would happen is disks would just go bad over time.  The DataNode
>>> would notice this and take them offline.  But then, due to some
>>> "optimistic" code, the DataNode would periodically try to re-add them to
>>> the system.  Then one of two things would happen: the disk would just
>> fail
>>> immediately again, or it would appear to work and then fail after a short
>>> amount of time.
>>>
>>> The way the disk failed was normally having an I/O request take a really
>>> long time and time out.  So a bunch of request handler threads would
>>> basically slam into a brick wall when they tried to access the bad disk,
>>> slowing the DataNode to a crawl.  It was even worse in the second
>> scenario,
>>> if the disk appeared to work for a while, but then failed.  Any data that
>>> had been written on that DataNode to that disk would be lost, and we
>> would
>>> need to re-replicate it.
>>>
>>> Disks aren't biological systems-- they don't heal over time.  Once
>> they're
>>> bad, they stay bad.  The log cleaner needs to be robust against cases
>> where
>>> the disk really is failing, and really is returning bad data or timing
>> out.
>>>> * For your metrics, can you spell out the full metric in JMX-style
>>>> format, such as:
>>>>
>>   kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
>>>>                value=4
>>>>
>>>> * For "uncleanable-partitions": topic-partition names can be very long.
>>>> I think the current max size is 210 characters (or maybe 240-ish?).
>>>> Having the "uncleanable-partitions" being a list could be very large
>>>> metric. Also, having the metric come out as a csv might be difficult to
>>>> work with for monitoring systems. If we *did* want the topic names to
>> be
>>>> accessible, what do you think of having the
>>>>        kafka.log:type=LogCleanerManager,topic=topic1,partition=2
>>>> I'm not sure if LogCleanerManager is the right type, but my example was
>>>> that the topic and partition can be tags in the metric. That will allow
>>>> monitoring systems to more easily slice and dice the metric. I'm not
>>>> sure what the attribute for that metric would be. Maybe something like
>>>> "uncleaned bytes" for that topic-partition? Or time-since-last-clean?
>> Or
>>>> maybe even just "Value=1".
>>> I haven't though about this that hard, but do we really need the
>>> uncleanable topic names to be accessible through a metric?  It seems like
>>> the admin should notice that uncleanable partitions are present, and then
>>> check the logs?
>>>
>>>> * About `max.uncleanable.partitions`, you said that this likely
>>>> indicates that the disk is having problems. I'm not sure that is the
>>>> case. For the 4 JIRAs that you mentioned about log cleaner problems,
>> all
>>>> of them are partition-level scenarios that happened during normal
>>>> operation. None of them were indicative of disk problems.
>>> I don't think this is a meaningful comparison.  In general, we don't
>>> accept JIRAs for hard disk problems that happen on a particular cluster.
>>> If someone opened a JIRA that said "my hard disk is having problems" we
>>> could close that as "not a Kafka bug."  This doesn't prove that disk
>>> problems don't happen, but  just that JIRA isn't the right place for
>> them.
>>> I do agree that the log cleaner has had a significant number of logic
>>> bugs, and that we need to be careful to limit their impact.  That's one
>>> reason why I think that a threshold of "number of uncleanable logs" is a
>>> good idea, rather than just failing after one IOException.  In all the
>>> cases I've seen where a user hit a logic bug in the log cleaner, it was
>>> just one partition that had the issue.  We also should increase test
>>> coverage for the log cleaner.
>>>
>>>> * About marking disks as offline when exceeding a certain threshold,
>>>> that actually increases the blast radius of log compaction failures.
>>>> Currently, the uncleaned partitions are still readable and writable.
>>>> Taking the disks offline would impact availability of the uncleanable
>>>> partitions, as well as impact all other partitions that are on the
>> disk.
>>> In general, when we encounter I/O errors, we take the disk partition
>>> offline.  This is spelled out in KIP-112 (
>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
>>> ) :
>>>
>>>> - Broker assumes a log directory to be good after it starts, and mark
>>> log directory as
>>>> bad once there is IOException when broker attempts to access (i.e. read
>>> or write) the log directory.
>>>> - Broker will be offline if all log directories are bad.
>>>> - Broker will stop serving replicas in any bad log directory. New
>>> replicas will only be created
>>>> on good log directory.
>>> The behavior Stanislav is proposing for the log cleaner is actually more
>>> optimistic than what we do for regular broker I/O, since we will tolerate
>>> multiple IOExceptions, not just one.  But it's generally consistent.
>>> Ignoring errors is not.  In any case, if you want to tolerate an
>> unlimited
>>> number of I/O errors, you can just set the threshold to an infinite value
>>> (although I think that would be a bad idea).
>>>
>>> best,
>>> Colin
>>>
>>>> -James
>>>>
>>>>
>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
>>> stanislav@confluent.io> wrote:
>>>>> I renamed the KIP and that changed the link. Sorry about that. Here
>> is
>>> the
>>>>> new link:
>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
>>> stanislav@confluent.io>
>>>>> wrote:
>>>>>
>>>>>> Hey group,
>>>>>>
>>>>>> I created a new KIP about making log compaction more fault-tolerant.
>>>>>> Please give it a look here and please share what you think,
>>> especially in
>>>>>> regards to the points in the "Needs Discussion" paragraph.
>>>>>>
>>>>>> KIP: KIP-346
>>>>>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
>>>>>> --
>>>>>> Best,
>>>>>> Stanislav
>>>>>>
>>>>>
>>>>> --
>>>>> Best,
>>>>> Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Dhruvil Shah <dh...@confluent.io>.

For the cleaner thread specifically, I do not think respawning will help at
all because we are more than likely to run into the same issue again which
would end up crashing the cleaner. Retrying makes sense for transient
errors or when you believe some part of the system could have healed
itself, both of which I think are not true for the log cleaner.

On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rn...@gmail.com> wrote:

> <<<respawning threads is likely to make things worse, by putting you in an
> infinite loop which consumes resources and fires off continuous log
> messages.
> Hi Colin.  In case it could be relevant, one way to mitigate this effect is
> to implement a backoff mechanism (if a second respawn is to occur then wait
> for 1 minute before doing it; then if a third respawn is to occur wait for
> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to some max
> wait time).
>
> I have no opinion on whether respawn is appropriate or not in this context,
> but a mitigation like the increasing backoff described above may be
> relevant in weighing the pros and cons.
>
> Ron
>
> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cm...@apache.org> wrote:
>
> > On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> > > Hi Stanislav! Thanks for this KIP!
> > >
> > > I agree that it would be good if the LogCleaner were more tolerant of
> > > errors. Currently, as you said, once it dies, it stays dead.
> > >
> > > Things are better now than they used to be. We have the metric
> > >       kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
> > > which we can use to tell us if the threads are dead. And as of 1.1.0,
> we
> > > have KIP-226, which allows you to restart the log cleaner thread,
> > > without requiring a broker restart.
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> > > <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> >
> >
> > > I've only read about this, I haven't personally tried it.
> >
> > Thanks for pointing this out, James!  Stanislav, we should probably add a
> > sentence or two mentioning the KIP-226 changes somewhere in the KIP.
> Maybe
> > in the intro section?
> >
> > I think it's clear that requiring the users to manually restart the log
> > cleaner is not a very good solution.  But it's good to know that it's a
> > possibility on some older releases.
> >
> > >
> > > Some comments:
> > > * I like the idea of having the log cleaner continue to clean as many
> > > partitions as it can, skipping over the problematic ones if possible.
> > >
> > > * If the log cleaner thread dies, I think it should automatically be
> > > revived. Your KIP attempts to do that by catching exceptions during
> > > execution, but I think we should go all the way and make sure that a
> new
> > > one gets created, if the thread ever dies.
> >
> > This is inconsistent with the way the rest of Kafka works.  We don't
> > automatically re-create other threads in the broker if they terminate.
> In
> > general, if there is a serious bug in the code, respawning threads is
> > likely to make things worse, by putting you in an infinite loop which
> > consumes resources and fires off continuous log messages.
> >
> > >
> > > * It might be worth trying to re-clean the uncleanable partitions. I've
> > > seen cases where an uncleanable partition later became cleanable. I
> > > unfortunately don't remember how that happened, but I remember being
> > > surprised when I discovered it. It might have been something like a
> > > follower was uncleanable but after a leader election happened, the log
> > > truncated and it was then cleanable again. I'm not sure.
> >
> > James, I disagree.  We had this behavior in the Hadoop Distributed File
> > System (HDFS) and it was a constant source of user problems.
> >
> > What would happen is disks would just go bad over time.  The DataNode
> > would notice this and take them offline.  But then, due to some
> > "optimistic" code, the DataNode would periodically try to re-add them to
> > the system.  Then one of two things would happen: the disk would just
> fail
> > immediately again, or it would appear to work and then fail after a short
> > amount of time.
> >
> > The way the disk failed was normally having an I/O request take a really
> > long time and time out.  So a bunch of request handler threads would
> > basically slam into a brick wall when they tried to access the bad disk,
> > slowing the DataNode to a crawl.  It was even worse in the second
> scenario,
> > if the disk appeared to work for a while, but then failed.  Any data that
> > had been written on that DataNode to that disk would be lost, and we
> would
> > need to re-replicate it.
> >
> > Disks aren't biological systems-- they don't heal over time.  Once
> they're
> > bad, they stay bad.  The log cleaner needs to be robust against cases
> where
> > the disk really is failing, and really is returning bad data or timing
> out.
> >
> > >
> > > * For your metrics, can you spell out the full metric in JMX-style
> > > format, such as:
> > >
>  kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
> > >               value=4
> > >
> > > * For "uncleanable-partitions": topic-partition names can be very long.
> > > I think the current max size is 210 characters (or maybe 240-ish?).
> > > Having the "uncleanable-partitions" being a list could be very large
> > > metric. Also, having the metric come out as a csv might be difficult to
> > > work with for monitoring systems. If we *did* want the topic names to
> be
> > > accessible, what do you think of having the
> > >       kafka.log:type=LogCleanerManager,topic=topic1,partition=2
> > > I'm not sure if LogCleanerManager is the right type, but my example was
> > > that the topic and partition can be tags in the metric. That will allow
> > > monitoring systems to more easily slice and dice the metric. I'm not
> > > sure what the attribute for that metric would be. Maybe something like
> > > "uncleaned bytes" for that topic-partition? Or time-since-last-clean?
> Or
> > > maybe even just "Value=1".
> >
> > I haven't though about this that hard, but do we really need the
> > uncleanable topic names to be accessible through a metric?  It seems like
> > the admin should notice that uncleanable partitions are present, and then
> > check the logs?
> >
> > >
> > > * About `max.uncleanable.partitions`, you said that this likely
> > > indicates that the disk is having problems. I'm not sure that is the
> > > case. For the 4 JIRAs that you mentioned about log cleaner problems,
> all
> > > of them are partition-level scenarios that happened during normal
> > > operation. None of them were indicative of disk problems.
> >
> > I don't think this is a meaningful comparison.  In general, we don't
> > accept JIRAs for hard disk problems that happen on a particular cluster.
> > If someone opened a JIRA that said "my hard disk is having problems" we
> > could close that as "not a Kafka bug."  This doesn't prove that disk
> > problems don't happen, but  just that JIRA isn't the right place for
> them.
> >
> > I do agree that the log cleaner has had a significant number of logic
> > bugs, and that we need to be careful to limit their impact.  That's one
> > reason why I think that a threshold of "number of uncleanable logs" is a
> > good idea, rather than just failing after one IOException.  In all the
> > cases I've seen where a user hit a logic bug in the log cleaner, it was
> > just one partition that had the issue.  We also should increase test
> > coverage for the log cleaner.
> >
> > > * About marking disks as offline when exceeding a certain threshold,
> > > that actually increases the blast radius of log compaction failures.
> > > Currently, the uncleaned partitions are still readable and writable.
> > > Taking the disks offline would impact availability of the uncleanable
> > > partitions, as well as impact all other partitions that are on the
> disk.
> >
> > In general, when we encounter I/O errors, we take the disk partition
> > offline.  This is spelled out in KIP-112 (
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
> > ) :
> >
> > > - Broker assumes a log directory to be good after it starts, and mark
> > log directory as
> > > bad once there is IOException when broker attempts to access (i.e. read
> > or write) the log directory.
> > > - Broker will be offline if all log directories are bad.
> > > - Broker will stop serving replicas in any bad log directory. New
> > replicas will only be created
> > > on good log directory.
> >
> > The behavior Stanislav is proposing for the log cleaner is actually more
> > optimistic than what we do for regular broker I/O, since we will tolerate
> > multiple IOExceptions, not just one.  But it's generally consistent.
> > Ignoring errors is not.  In any case, if you want to tolerate an
> unlimited
> > number of I/O errors, you can just set the threshold to an infinite value
> > (although I think that would be a bad idea).
> >
> > best,
> > Colin
> >
> > >
> > > -James
> > >
> > >
> > > > On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
> > stanislav@confluent.io> wrote:
> > > >
> > > > I renamed the KIP and that changed the link. Sorry about that. Here
> is
> > the
> > > > new link:
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> > > >
> > > > On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
> > stanislav@confluent.io>
> > > > wrote:
> > > >
> > > >> Hey group,
> > > >>
> > > >> I created a new KIP about making log compaction more fault-tolerant.
> > > >> Please give it a look here and please share what you think,
> > especially in
> > > >> regards to the points in the "Needs Discussion" paragraph.
> > > >>
> > > >> KIP: KIP-346
> > > >> <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
> > >
> > > >> --
> > > >> Best,
> > > >> Stanislav
> > > >>
> > > >
> > > >
> > > > --
> > > > Best,
> > > > Stanislav
> > >
> >
>

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Ron Dagostino <rn...@gmail.com>.

<<<respawning threads is likely to make things worse, by putting you in an
infinite loop which consumes resources and fires off continuous log
messages.
Hi Colin.  In case it could be relevant, one way to mitigate this effect is
to implement a backoff mechanism (if a second respawn is to occur then wait
for 1 minute before doing it; then if a third respawn is to occur wait for
2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to some max
wait time).

I have no opinion on whether respawn is appropriate or not in this context,
but a mitigation like the increasing backoff described above may be
relevant in weighing the pros and cons.

Ron

On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cm...@apache.org> wrote:

> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> > Hi Stanislav! Thanks for this KIP!
> >
> > I agree that it would be good if the LogCleaner were more tolerant of
> > errors. Currently, as you said, once it dies, it stays dead.
> >
> > Things are better now than they used to be. We have the metric
> >       kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
> > which we can use to tell us if the threads are dead. And as of 1.1.0, we
> > have KIP-226, which allows you to restart the log cleaner thread,
> > without requiring a broker restart.
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> > <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration>
>
> > I've only read about this, I haven't personally tried it.
>
> Thanks for pointing this out, James!  Stanislav, we should probably add a
> sentence or two mentioning the KIP-226 changes somewhere in the KIP.  Maybe
> in the intro section?
>
> I think it's clear that requiring the users to manually restart the log
> cleaner is not a very good solution.  But it's good to know that it's a
> possibility on some older releases.
>
> >
> > Some comments:
> > * I like the idea of having the log cleaner continue to clean as many
> > partitions as it can, skipping over the problematic ones if possible.
> >
> > * If the log cleaner thread dies, I think it should automatically be
> > revived. Your KIP attempts to do that by catching exceptions during
> > execution, but I think we should go all the way and make sure that a new
> > one gets created, if the thread ever dies.
>
> This is inconsistent with the way the rest of Kafka works.  We don't
> automatically re-create other threads in the broker if they terminate.  In
> general, if there is a serious bug in the code, respawning threads is
> likely to make things worse, by putting you in an infinite loop which
> consumes resources and fires off continuous log messages.
>
> >
> > * It might be worth trying to re-clean the uncleanable partitions. I've
> > seen cases where an uncleanable partition later became cleanable. I
> > unfortunately don't remember how that happened, but I remember being
> > surprised when I discovered it. It might have been something like a
> > follower was uncleanable but after a leader election happened, the log
> > truncated and it was then cleanable again. I'm not sure.
>
> James, I disagree.  We had this behavior in the Hadoop Distributed File
> System (HDFS) and it was a constant source of user problems.
>
> What would happen is disks would just go bad over time.  The DataNode
> would notice this and take them offline.  But then, due to some
> "optimistic" code, the DataNode would periodically try to re-add them to
> the system.  Then one of two things would happen: the disk would just fail
> immediately again, or it would appear to work and then fail after a short
> amount of time.
>
> The way the disk failed was normally having an I/O request take a really
> long time and time out.  So a bunch of request handler threads would
> basically slam into a brick wall when they tried to access the bad disk,
> slowing the DataNode to a crawl.  It was even worse in the second scenario,
> if the disk appeared to work for a while, but then failed.  Any data that
> had been written on that DataNode to that disk would be lost, and we would
> need to re-replicate it.
>
> Disks aren't biological systems-- they don't heal over time.  Once they're
> bad, they stay bad.  The log cleaner needs to be robust against cases where
> the disk really is failing, and really is returning bad data or timing out.
>
> >
> > * For your metrics, can you spell out the full metric in JMX-style
> > format, such as:
> >       kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
> >               value=4
> >
> > * For "uncleanable-partitions": topic-partition names can be very long.
> > I think the current max size is 210 characters (or maybe 240-ish?).
> > Having the "uncleanable-partitions" being a list could be very large
> > metric. Also, having the metric come out as a csv might be difficult to
> > work with for monitoring systems. If we *did* want the topic names to be
> > accessible, what do you think of having the
> >       kafka.log:type=LogCleanerManager,topic=topic1,partition=2
> > I'm not sure if LogCleanerManager is the right type, but my example was
> > that the topic and partition can be tags in the metric. That will allow
> > monitoring systems to more easily slice and dice the metric. I'm not
> > sure what the attribute for that metric would be. Maybe something like
> > "uncleaned bytes" for that topic-partition? Or time-since-last-clean? Or
> > maybe even just "Value=1".
>
> I haven't though about this that hard, but do we really need the
> uncleanable topic names to be accessible through a metric?  It seems like
> the admin should notice that uncleanable partitions are present, and then
> check the logs?
>
> >
> > * About `max.uncleanable.partitions`, you said that this likely
> > indicates that the disk is having problems. I'm not sure that is the
> > case. For the 4 JIRAs that you mentioned about log cleaner problems, all
> > of them are partition-level scenarios that happened during normal
> > operation. None of them were indicative of disk problems.
>
> I don't think this is a meaningful comparison.  In general, we don't
> accept JIRAs for hard disk problems that happen on a particular cluster.
> If someone opened a JIRA that said "my hard disk is having problems" we
> could close that as "not a Kafka bug."  This doesn't prove that disk
> problems don't happen, but  just that JIRA isn't the right place for them.
>
> I do agree that the log cleaner has had a significant number of logic
> bugs, and that we need to be careful to limit their impact.  That's one
> reason why I think that a threshold of "number of uncleanable logs" is a
> good idea, rather than just failing after one IOException.  In all the
> cases I've seen where a user hit a logic bug in the log cleaner, it was
> just one partition that had the issue.  We also should increase test
> coverage for the log cleaner.
>
> > * About marking disks as offline when exceeding a certain threshold,
> > that actually increases the blast radius of log compaction failures.
> > Currently, the uncleaned partitions are still readable and writable.
> > Taking the disks offline would impact availability of the uncleanable
> > partitions, as well as impact all other partitions that are on the disk.
>
> In general, when we encounter I/O errors, we take the disk partition
> offline.  This is spelled out in KIP-112 (
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
> ) :
>
> > - Broker assumes a log directory to be good after it starts, and mark
> log directory as
> > bad once there is IOException when broker attempts to access (i.e. read
> or write) the log directory.
> > - Broker will be offline if all log directories are bad.
> > - Broker will stop serving replicas in any bad log directory. New
> replicas will only be created
> > on good log directory.
>
> The behavior Stanislav is proposing for the log cleaner is actually more
> optimistic than what we do for regular broker I/O, since we will tolerate
> multiple IOExceptions, not just one.  But it's generally consistent.
> Ignoring errors is not.  In any case, if you want to tolerate an unlimited
> number of I/O errors, you can just set the threshold to an infinite value
> (although I think that would be a bad idea).
>
> best,
> Colin
>
> >
> > -James
> >
> >
> > > On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
> stanislav@confluent.io> wrote:
> > >
> > > I renamed the KIP and that changed the link. Sorry about that. Here is
> the
> > > new link:
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> > >
> > > On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
> stanislav@confluent.io>
> > > wrote:
> > >
> > >> Hey group,
> > >>
> > >> I created a new KIP about making log compaction more fault-tolerant.
> > >> Please give it a look here and please share what you think,
> especially in
> > >> regards to the points in the "Needs Discussion" paragraph.
> > >>
> > >> KIP: KIP-346
> > >> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
> >
> > >> --
> > >> Best,
> > >> Stanislav
> > >>
> > >
> > >
> > > --
> > > Best,
> > > Stanislav
> >
>

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Colin McCabe <cm...@apache.org>.

On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> Hi Stanislav! Thanks for this KIP!
> 
> I agree that it would be good if the LogCleaner were more tolerant of 
> errors. Currently, as you said, once it dies, it stays dead. 
> 
> Things are better now than they used to be. We have the metric
> 	kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
> which we can use to tell us if the threads are dead. And as of 1.1.0, we 
> have KIP-226, which allows you to restart the log cleaner thread, 
> without requiring a broker restart. 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration 
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration> 
> I've only read about this, I haven't personally tried it.

Thanks for pointing this out, James!  Stanislav, we should probably add a sentence or two mentioning the KIP-226 changes somewhere in the KIP.  Maybe in the intro section?

I think it's clear that requiring the users to manually restart the log cleaner is not a very good solution.  But it's good to know that it's a possibility on some older releases.

> 
> Some comments:
> * I like the idea of having the log cleaner continue to clean as many 
> partitions as it can, skipping over the problematic ones if possible.
> 
> * If the log cleaner thread dies, I think it should automatically be 
> revived. Your KIP attempts to do that by catching exceptions during 
> execution, but I think we should go all the way and make sure that a new 
> one gets created, if the thread ever dies.

This is inconsistent with the way the rest of Kafka works.  We don't automatically re-create other threads in the broker if they terminate.  In general, if there is a serious bug in the code, respawning threads is likely to make things worse, by putting you in an infinite loop which consumes resources and fires off continuous log messages.

> 
> * It might be worth trying to re-clean the uncleanable partitions. I've 
> seen cases where an uncleanable partition later became cleanable. I 
> unfortunately don't remember how that happened, but I remember being 
> surprised when I discovered it. It might have been something like a 
> follower was uncleanable but after a leader election happened, the log 
> truncated and it was then cleanable again. I'm not sure.

James, I disagree.  We had this behavior in the Hadoop Distributed File System (HDFS) and it was a constant source of user problems.

What would happen is disks would just go bad over time.  The DataNode would notice this and take them offline.  But then, due to some "optimistic" code, the DataNode would periodically try to re-add them to the system.  Then one of two things would happen: the disk would just fail immediately again, or it would appear to work and then fail after a short amount of time.

The way the disk failed was normally having an I/O request take a really long time and time out.  So a bunch of request handler threads would basically slam into a brick wall when they tried to access the bad disk, slowing the DataNode to a crawl.  It was even worse in the second scenario, if the disk appeared to work for a while, but then failed.  Any data that had been written on that DataNode to that disk would be lost, and we would need to re-replicate it.

Disks aren't biological systems-- they don't heal over time.  Once they're bad, they stay bad.  The log cleaner needs to be robust against cases where the disk really is failing, and really is returning bad data or timing out.

> 
> * For your metrics, can you spell out the full metric in JMX-style 
> format, such as:
> 	kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
> 		value=4
> 
> * For "uncleanable-partitions": topic-partition names can be very long. 
> I think the current max size is 210 characters (or maybe 240-ish?). 
> Having the "uncleanable-partitions" being a list could be very large 
> metric. Also, having the metric come out as a csv might be difficult to 
> work with for monitoring systems. If we *did* want the topic names to be 
> accessible, what do you think of having the 
> 	kafka.log:type=LogCleanerManager,topic=topic1,partition=2
> I'm not sure if LogCleanerManager is the right type, but my example was 
> that the topic and partition can be tags in the metric. That will allow 
> monitoring systems to more easily slice and dice the metric. I'm not 
> sure what the attribute for that metric would be. Maybe something like  
> "uncleaned bytes" for that topic-partition? Or time-since-last-clean? Or 
> maybe even just "Value=1".

I haven't though about this that hard, but do we really need the uncleanable topic names to be accessible through a metric?  It seems like the admin should notice that uncleanable partitions are present, and then check the logs?

> 
> * About `max.uncleanable.partitions`, you said that this likely 
> indicates that the disk is having problems. I'm not sure that is the 
> case. For the 4 JIRAs that you mentioned about log cleaner problems, all 
> of them are partition-level scenarios that happened during normal 
> operation. None of them were indicative of disk problems.

I don't think this is a meaningful comparison.  In general, we don't accept JIRAs for hard disk problems that happen on a particular cluster.  If someone opened a JIRA that said "my hard disk is having problems" we could close that as "not a Kafka bug."  This doesn't prove that disk problems don't happen, but  just that JIRA isn't the right place for them.

I do agree that the log cleaner has had a significant number of logic bugs, and that we need to be careful to limit their impact.  That's one reason why I think that a threshold of "number of uncleanable logs" is a good idea, rather than just failing after one IOException.  In all the cases I've seen where a user hit a logic bug in the log cleaner, it was just one partition that had the issue.  We also should increase test coverage for the log cleaner.

> * About marking disks as offline when exceeding a certain threshold, 
> that actually increases the blast radius of log compaction failures. 
> Currently, the uncleaned partitions are still readable and writable. 
> Taking the disks offline would impact availability of the uncleanable 
> partitions, as well as impact all other partitions that are on the disk.

In general, when we encounter I/O errors, we take the disk partition offline.  This is spelled out in KIP-112 ( https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD ) :

> - Broker assumes a log directory to be good after it starts, and mark log directory as
> bad once there is IOException when broker attempts to access (i.e. read or write) the log directory.
> - Broker will be offline if all log directories are bad.
> - Broker will stop serving replicas in any bad log directory. New replicas will only be created 
> on good log directory.

The behavior Stanislav is proposing for the log cleaner is actually more optimistic than what we do for regular broker I/O, since we will tolerate multiple IOExceptions, not just one.  But it's generally consistent.  Ignoring errors is not.  In any case, if you want to tolerate an unlimited number of I/O errors, you can just set the threshold to an infinite value (although I think that would be a bad idea).

best,
Colin

> 
> -James
> 
> 
> > On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <st...@confluent.io> wrote:
> > 
> > I renamed the KIP and that changed the link. Sorry about that. Here is the
> > new link:
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> > 
> > On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <st...@confluent.io>
> > wrote:
> > 
> >> Hey group,
> >> 
> >> I created a new KIP about making log compaction more fault-tolerant.
> >> Please give it a look here and please share what you think, especially in
> >> regards to the points in the "Needs Discussion" paragraph.
> >> 
> >> KIP: KIP-346
> >> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure>
> >> --
> >> Best,
> >> Stanislav
> >> 
> > 
> > 
> > -- 
> > Best,
> > Stanislav
>

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by James Cheng <wu...@gmail.com>.

Hi Stanislav! Thanks for this KIP!

I agree that it would be good if the LogCleaner were more tolerant of errors. Currently, as you said, once it dies, it stays dead. 

Things are better now than they used to be. We have the metric
	kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
which we can use to tell us if the threads are dead. And as of 1.1.0, we have KIP-226, which allows you to restart the log cleaner thread, without requiring a broker restart. https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration <https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration> I've only read about this, I haven't personally tried it.

Some comments:
* I like the idea of having the log cleaner continue to clean as many partitions as it can, skipping over the problematic ones if possible.

* If the log cleaner thread dies, I think it should automatically be revived. Your KIP attempts to do that by catching exceptions during execution, but I think we should go all the way and make sure that a new one gets created, if the thread ever dies.

* It might be worth trying to re-clean the uncleanable partitions. I've seen cases where an uncleanable partition later became cleanable. I unfortunately don't remember how that happened, but I remember being surprised when I discovered it. It might have been something like a follower was uncleanable but after a leader election happened, the log truncated and it was then cleanable again. I'm not sure.

* For your metrics, can you spell out the full metric in JMX-style format, such as:
	kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
		value=4

* For "uncleanable-partitions": topic-partition names can be very long. I think the current max size is 210 characters (or maybe 240-ish?). Having the "uncleanable-partitions" being a list could be very large metric. Also, having the metric come out as a csv might be difficult to work with for monitoring systems. If we *did* want the topic names to be accessible, what do you think of having the 
	kafka.log:type=LogCleanerManager,topic=topic1,partition=2
I'm not sure if LogCleanerManager is the right type, but my example was that the topic and partition can be tags in the metric. That will allow monitoring systems to more easily slice and dice the metric. I'm not sure what the attribute for that metric would be. Maybe something like  "uncleaned bytes" for that topic-partition? Or time-since-last-clean? Or maybe even just "Value=1".

* About `max.uncleanable.partitions`, you said that this likely indicates that the disk is having problems. I'm not sure that is the case. For the 4 JIRAs that you mentioned about log cleaner problems, all of them are partition-level scenarios that happened during normal operation. None of them were indicative of disk problems.

* About marking disks as offline when exceeding a certain threshold, that actually increases the blast radius of log compaction failures. Currently, the uncleaned partitions are still readable and writable. Taking the disks offline would impact availability of the uncleanable partitions, as well as impact all other partitions that are on the disk.

-James

> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <st...@confluent.io> wrote:
> 
> I renamed the KIP and that changed the link. Sorry about that. Here is the
> new link:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> 
> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <st...@confluent.io>
> wrote:
> 
>> Hey group,
>> 
>> I created a new KIP about making log compaction more fault-tolerant.
>> Please give it a look here and please share what you think, especially in
>> regards to the points in the "Needs Discussion" paragraph.
>> 
>> KIP: KIP-346
>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure>
>> --
>> Best,
>> Stanislav
>> 
> 
> 
> -- 
> Best,
> Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Posted by Stanislav Kozlovski <st...@confluent.io>.

I renamed the KIP and that changed the link. Sorry about that. Here is the
new link:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error

On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <st...@confluent.io>
wrote:

> Hey group,
>
> I created a new KIP about making log compaction more fault-tolerant.
> Please give it a look here and please share what you think, especially in
> regards to the points in the "Needs Discussion" paragraph.
>
> KIP: KIP-346
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure>
> --
> Best,
> Stanislav
>

-- 
Best,
Stanislav