You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Jason Gustafson <ja...@confluent.io> on 2018/06/26 22:23:25 UTC

Re: [DISCUSS] KIP-263: Allow broker to skip sanity check of inactive segments on broker startup

Hey Dong,

Sorry for being slow to catch up to this.

I think the benefit of the sanity check seems a little dubious in the first
place. We detect garbage at the end of the index file, but that's about it.
Is there any reason to think that corruption is more likely to occur there
or any other reason to think this check is still beneficial for flushed
data? I assume we did the check because we presumed it was cheap, but
perhaps the cost is adding up as the number of partitions grows. How much
does startup time improve if we skip the sanity check for data earlier than
the recovery point? Does the lazy loading itself give some additional
benefit beyond skipping the sanity check? As Jay mentions above, the sanity
checks seem strictly speaking optional. We don't bother checking the
segments themselves for example.

Thanks,
Jason




It probably still makes sense for segments beyond the recovery point

On Wed, Mar 21, 2018 at 9:59 PM, Dong Lin <li...@gmail.com> wrote:

> Hey Jay,
>
> Yeah our existing sanity check only read the last entry in the index files.
> I must have miscommunicated if I previously said it was reading the full
> index. Broker appears to be spending a lot of time just to read the last
> entry of index files for every log segment. This is probably because OS
> will load a chunk of data that is much larger than the entry itself from
> disk to page cache. This KIP tries to make this part of operation lazy. I
> guess you are suggesting that we should just make the lazy loading the
> default behavior?
>
> Yes we currently require manual intervention if the log file is corrupted,
> i.e. if two messages with the same offset are appended to the disk
> (KAFKA-6488). The sanity check on broker startup is a bit different since
> it deals with the corruption of index files (e.g. offset index, time index
> and snapshot files) instead of the log data. In this case if index files
> are corrupted broker will automatically recover it by rebuilding the index
> files using data in the log files, without requiring manual intervention.
> Thus the design question is whether this should be done before broker can
> become leader for any partitions -- there is tradeoff between broker
> startup time and risk of delaying user requests if broker need to rebuild
> index files when it is already leader. I prefer lazy loading to reduce
> broker startup time. Not sure what are the feedback from the community on
> this issue.
>
> Thanks,
> Dong
>
>
> On Wed, Mar 21, 2018 at 7:36 AM, Jay Kreps <ja...@confluent.io> wrote:
>
> > Hey Dong,
> >
> > Makes total sense. What I'm saying is I don't think that the sanity check
> > is part of any formal guarantee we provide. It is true that corruption of
> > data flushed to disk will be a potential problem, but I don't think the
> > sanity check solves that it just has a couple heuristics to help detect
> > certain possible instances of it, right? In general I think our
> assumption
> > has been that flushed data doesn't disappear or get corrupted and if it
> > does you need to manually intervene. I don't think people want to
> configure
> > things at this level so what I was suggesting was understanding why the
> > sanity check is slow and trying to avoid that rather than making it
> > configurable. I think you mentioned it was reading the full index into
> > memory. Based on the performance you describe this could be true, but it
> > definitely should not be reading anything but the last entry in the index
> > so that would be a bug. That read also happens in sanityCheck() only in
> the
> > time-based index right? In the offset index we do the same read but it
> > happens in initialization. If that read is the slow thing it might make
> > sense to try to remove it or make it lazy in both cases. If it is some
> > other part of the code then (e.g. the size check) then that may be able
> to
> > be avoided entirely (I think by the time we sanity check we already know
> > the file size from the mapping...). That was what I meant by doing some
> > data driven analysis. Maybe a quick run with hprof would help determine
> the
> > root cause of why sanityCheck is slow?
> >
> > -Jay
> >
> > On Tue, Mar 20, 2018 at 12:13 AM Dong Lin <li...@gmail.com> wrote:
> >
> > > Hey Jay,
> > >
> > > Thanks for your comments!
> > >
> > > Yeah recovery is different from the sanity check. They are correlated
> in
> > > the sense that there may still be corrupted index files even after
> clean
> > > broker shutdown. And in that case if we delay the sanity check then we
> > may
> > > delay the log recovery. The main goal of this KIP is to optimize the
> > sanity
> > > check related work so that it does not delay the broker startup much.
> > >
> > > The KIP mentioned that the sanity check is done using log recovery
> > > background thread. The name "recovery" is mentioned mainly because the
> > > background thread number is determined using the existing
> > > config num.recovery.threads.per.data.dir. I have updated the KIP to
> make
> > > this less confusing.
> > >
> > > It makes a ton of sense to optimize the broker startup time in a data
> > > driven fashion. The currently optimize is done kind of in this fashion.
> > The
> > > broker log shows that LogManager.loadLogs() takes a long time in large
> > > clusters. Then I started broker with cold cache and repeatedly get
> thread
> > > dump to see what are broker threads are doing during
> > LogManager.loadLogs().
> > > Most of the threads are working on sanityCheck() and this motivates the
> > > change in this KIP. Previously broker shutdown time was investigated
> in a
> > > similar data driven fashion and optimized with KAFKA-6172 and
> KAFKA-6175.
> > > It seems that the current KIP can reduces the rolling bounce time of a
> > > large cluster by 50% -- there may be room for further improvement but
> > maybe
> > > those do not require as big a change (with the caveat described in the
> > KIP)
> > > as suggested in this KIP.
> > >
> > > It is not clear whether it is safe to just read the latest segment
> > without
> > > sanity checking all previous inactive segment of a given partition if
> > > transaction is used. Otherwise we probably want to always skip the
> sanity
> > > check of inactive segments without introducing a new config. Maybe the
> > > developers familiar with the transaction can comment on that?
> > >
> > > Thanks,
> > > Dong
> > >
> > >
> > > On Mon, Mar 19, 2018 at 7:21 PM, Jay Kreps <ja...@confluent.io> wrote:
> > >
> > > > Optimizing startup seems really valuable but I'm a little confused by
> > > this.
> > > >
> > > > There are two different things:
> > > > 1. Recovery
> > > > 2. Sanity check
> > > >
> > > > The terminology we're using is a bit mixed here.
> > > >
> > > > Recovery means checksumming the log segments and rebuilding the index
> > on
> > > a
> > > > hard crash. This only happens on unflushed segments, which is
> generally
> > > > just the last segment. Recovery is essential for the correctness
> > > guarantees
> > > > of the log and you shouldn't disable it. It only happens on hard
> crash
> > > and
> > > > is not a factor in graceful restart. We can likely optimize it but
> that
> > > > would make most sense to do in a data driven fashion off some
> > profiling.
> > > >
> > > > However there is also a ton of disk activity that happens during
> > > > initialization (lots of checks on the file size, absolute path,
> etc). I
> > > > think these have crept in over time with people not really realizing
> > this
> > > > code is perf sensitive and java hiding a lot of what is and isn't a
> > file
> > > > operation. One part of this is the sanityCheck() call for the two
> > > indexes.
> > > > I don't think this call reads the full index, just the last entry in
> > the
> > > > index, right?. There should be no need to read the full index except
> > > during
> > > > recovery (and then only for the segments being recovered). I think it
> > > would
> > > > make a ton of sense to optimize this but I don't think that
> > optimization
> > > > needs to be configurable as this is just a helpful sanity check to
> > detect
> > > > common non-sensical things in the index files, but it isn't part of
> the
> > > > core guarantees, in general you aren't supposed to lose committed
> data
> > > from
> > > > disk, and if you do we may be able to fail faster but we
> fundamentally
> > > > can't really help you. Again I think this would make the most sense
> to
> > do
> > > > in a data driven way, if you look at that code I think it is doing
> > crazy
> > > > amounts of file operations (e.g. getAbsolutePath, file sizes, etc). I
> > > think
> > > > it'd make most sense to profile startup with a cold cash on a large
> log
> > > > directory and do the same with an strace to see how many redundant
> > system
> > > > calls we do per segment and what is costing us and then cut some of
> > this
> > > > out. I suspect we could speed up our startup time quite a lot if we
> did
> > > > that.
> > > >
> > > > For example we have a bunch of calls like this:
> > > >
> > > >     require(len % entrySize == 0,
> > > >
> > > >             "Index file " + file.getAbsolutePath + " is corrupt,
> found
> > "
> > > +
> > > > len +
> > > >
> > > >             " bytes which is not positive or not a multiple of 8.")
> > > > I'm pretty such file.getAbsolutePath is a system call and I assume
> that
> > > > happens whether or not you fail the in-memory check?
> > > >
> > > > -Jay
> > > >
> > > >
> > > > On Sun, Feb 25, 2018 at 10:27 PM, Dong Lin <li...@gmail.com>
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I have created KIP-263: Allow broker to skip sanity check of
> inactive
> > > > > segments on broker startup. See
> > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > 263%3A+Allow+broker+to+skip+sanity+check+of+inactive+
> > > > > segments+on+broker+startup
> > > > > .
> > > > >
> > > > > This KIP provides a way to significantly reduce time to rolling
> > bounce
> > > a
> > > > > Kafka cluster.
> > > > >
> > > > > Comments are welcome!
> > > > >
> > > > > Thanks,
> > > > > Dong
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-263: Allow broker to skip sanity check of inactive segments on broker startup

Posted by Patrick Huang <hz...@hotmail.com>.

Thanks, Dong!


In terms of the background retention thread reading the last entry of timeindex files, there are two scenarios we need to consider:

  1.  Most segments in each log have passed their TTL. In this case, the background thread has to read the last entry of the timeindex files for all these segments. If we skip sanity checks and skip pre-loading the indexes of the segments below the recovery point, this will cause disk spike which then increases the processing time of client requests.
  2.  Only a few segments in each log have passed their TTL. In this case, the background thread only needs read the last entry of the timeindex files for very few segments because it will stop iterating through the segments once it sees a segment with "now - segment.largestTimestamp <= config.retentionMs". This has little impact on the disk and will not affect client requests even if we skip sanity checks because the cost of reading the last entry of timeindex files is amortized across time.


Test results showed that:

  *   Rolling bounce time is reduced by 50% in both scenarios if we skip sanity checks and skip pre-loading the indexes of the segments below the recovery point.
  *   In scenario 1, we saw disk await time increased and didn't go down until the retention thread finished iterating through the segments reading the last entry of timeindex files. During this time period, there was a 5x increase in 999th processing time of produce request
  *   In scenario 2, we saw a little increase in disk await time only for a very short period of time and client requests were not affected.


Scenairo 2 is the common case because most of the time, the partition bytes in rate is steady and after broker starts up, the retention thread only needs to do the time-based retention check on the first few segments of each partition. Scenario 1 happens only when we have huge bytes in traffic and small segment size for a partition, or the broker has stopped for a long time, which are rare. Given that there will be little impact for the common scenario, I think it is safe to skip the sanity checks for segments below the recovery point to speed up broker restart.


Best,
Zhanxiang (Patrick) Huang


________________________________
From: Dong Lin <li...@gmail.com>
Sent: Saturday, July 21, 2018 3:45
To: dev
Subject: Re: [DISCUSS] KIP-263: Allow broker to skip sanity check of inactive segments on broker startup

Here is some information related to the KIP.

Previously we thought we can ignore the sanity check by default to speedup
broker startup and there is no need for extra configuration. However based
on the code and some previous experiment result it is not clear whether it
will work. When LogManager.cleanupLogs() is called by the background
thread, the thread will go over all segments and read the last entry of
timeindex file of each segment to see whether the segment should be deleted
based on time. This may cause spike in the disk usage and reduce broker
startup time.

More test is needed to validate whether this is an issue. If this is an
issue, we have validated that the issue can be solved by using a few extra
configs. More information will be posted later when we get the result.



On Wed, Jun 27, 2018 at 10:59 AM, Dong Lin <li...@gmail.com> wrote:

> Thanks for the reply Jason and Dhruvil.
>
> Yeah we don't need config for the sanity check and thus we don't need a
> KIP. I think we are on the same page of just skipping the sanity check of
> segments before the recovery offset. I will close the KIP and submit a
> patch for this.
>
> On Wed, Jun 27, 2018 at 10:09 AM, Dhruvil Shah <dh...@confluent.io>
> wrote:
>
>> +1 to what Jason said. We need a better long-term strategy for dealing
>> with
>> corrupted log and index data, but the sanity checks we have do not
>> guarantee much in this regard.
>>
>> For now, we could do away with these index sanity checks in my opinion. We
>> could handle the missing index case at startup. I think we could have
>> missing index files only when users are upgrading from a version that did
>> not have a particular type of index to a version that does, or if the
>> operator physically deleted these files. Because these are rare scenarios,
>> having to recreate a missing index should typically not affect normal
>> startup time.
>>
>> - Dhruvil
>>
>> On Wed, Jun 27, 2018 at 8:47 AM Jason Gustafson <ja...@confluent.io>
>> wrote:
>>
>> > Hey Dong,
>> >
>> >
>> > So the main concern with the above approach is that, if for any reason
>> the
>> > > index files of inactive segment is deleted or corrupted, the broker
>> will
>> > > halt if there is only one log directory. This is different from the
>> > > existing behavior where the broker will rebuild the index for this
>> > inactive
>> > > segment before it can accept any request from consumer. Though we
>> don't
>> > > have provide guarantee for segments already flushed to disk, this
>> still
>> > > seems like a change in behavior for user. Maybe we don't have to worry
>> > > about this if we decide it is very rare, e.g. it happens only when
>> there
>> > is
>> > > disk error or when there is human error.
>> >
>> >
>> > I think we should probably still handle the case when an index file is
>> > missing during startup? But considering how weak the sanity check is, it
>> > seems fine to skip it.  Also, could we just make this change without a
>> KIP?
>> > Adding a config to enable a wimpy sanity check seems unnecessary.
>> >
>> > One scenario that does come up with users is actual segment corruption,
>> > which is only detected by consumers that are validating CRCs. To fix
>> it, we
>> > have to manually delete the segments and force re-replication. It would
>> be
>> > helpful to have a config to enable deep checking on startup for
>> particular
>> > topics or partitions. This could also just be a separate tool though
>> > ("kafka-fsck" or something).
>> >
>> > Thinking longer term, I think we need a more systematic approach to
>> dealing
>> > with corruption, not just in index files, but in the segments as well.
>> It
>> > might be nice, for example, if the consumer had a way to hint the broker
>> > that a particular offset is corrupt. The leader might then demote
>> itself,
>> > for example, and try to repair. Lots to think through though.
>> >
>> > -Jason
>> >
>> >
>> >
>> >
>> > On Wed, Jun 27, 2018 at 12:29 AM, Dong Lin <li...@gmail.com> wrote:
>> >
>> > > So the main concern with the above approach is that, if for any reason
>> > the
>> > > index files of inactive segment is deleted or corrupted, the broker
>> will
>> > > halt if there is only one log directory. This is different from the
>> > > existing behavior where the broker will rebuild the index for this
>> > inactive
>> > > segment before it can accept any request from consumer. Though we
>> don't
>> > > have provide guarantee for segments already flushed to disk, this
>> still
>> > > seems like a change in behavior for user. Maybe we don't have to worry
>> > > about this if we decide it is very rare, e.g. it happens only when
>> there
>> > is
>> > > disk error or when there is human error.
>> > >
>> > >
>> > >
>> > > On Wed, Jun 27, 2018 at 12:04 AM, Dong Lin <li...@gmail.com>
>> wrote:
>> > >
>> > > > Hey Jason,
>> > > >
>> > > > Thanks for the comment!
>> > > >
>> > > > Your comment reminded me to read through Jay's comments and my reply
>> > > > again. It seems that I probably have not captured idea of Jay's
>> comment
>> > > > that says sanity check is not part of any formal guarantee we
>> provide.
>> > I
>> > > > probably should have thought about this comment more. Let me reply
>> to
>> > > both
>> > > > yours and Jay's comment and see if I can understand you better.
>> > > >
>> > > > Here are some clarifications:
>> > > > - KIP does not intend to optimize recovery. It aims to optimize the
>> the
>> > > > sanity check when there is clean shutdown.
>> > > > - Sanity check only read the last entry of the index rather than the
>> > full
>> > > > index
>> > > > - We have already done data driven investigation though it is not
>> done
>> > > > using hprof or strace. The resulting rolling bounce time is
>> acceptable
>> > > now.
>> > > > If it appears to be an issue e.g. after more data then we may need
>> to
>> > > > revisit this with more data driven investigation
>> > > >
>> > > > I agree with the following comments:
>> > > > - We should optimize the default behavior instead of adding a new
>> > config.
>> > > > - sanity check of the segments before recovery offset is not part of
>> > any
>> > > > formal guarantee and thus we probably can just skip it.
>> > > >
>> > > > So we are all leaning towards skipping the sanity check of all
>> segments
>> > > > before the recovery offset. This solution would be pretty
>> > straightforward
>> > > > to understand and implement. And I am sure it will give us all the
>> > > benefits
>> > > > that this KIP intends to achieve. Here is only one question to
>> double
>> > > check:
>> > > >
>> > > > If consumer fetches from an inactive segment, broker will just use
>> the
>> > > > index of that inactive segment. If anything goes wrong, e.g. the
>> index
>> > > file
>> > > > is corrupted or the index file does not exist, then the broker will
>> > just
>> > > > consider it as IOException, mark the disk and the partitions on the
>> > disk
>> > > > offline and respond KafkaStorageException to consumer. Does this
>> sound
>> > > OK?
>> > > > One alternative solution is to let broker rebuild index. But this
>> > > > alternative solution is inconsistent with the idea that "sanity
>> check
>> > is
>> > > not
>> > > > part of any formal guarantee" and it may tie up all request handler
>> > > > thread for rebuilding the indexed.
>> > > >
>> > > >
>> > > > If this solution sounds right, I will update the KIP accordingly.
>> > > >
>> > > > Thanks,
>> > > > Dong
>> > > >
>> > > > On Tue, Jun 26, 2018 at 3:23 PM, Jason Gustafson <
>> jason@confluent.io>
>> > > > wrote:
>> > > >
>> > > >> Hey Dong,
>> > > >>
>> > > >> Sorry for being slow to catch up to this.
>> > > >>
>> > > >> I think the benefit of the sanity check seems a little dubious in
>> the
>> > > >> first
>> > > >> place. We detect garbage at the end of the index file, but that's
>> > about
>> > > >> it.
>> > > >> Is there any reason to think that corruption is more likely to
>> occur
>> > > there
>> > > >> or any other reason to think this check is still beneficial for
>> > flushed
>> > > >> data? I assume we did the check because we presumed it was cheap,
>> but
>> > > >> perhaps the cost is adding up as the number of partitions grows.
>> How
>> > > much
>> > > >> does startup time improve if we skip the sanity check for data
>> earlier
>> > > >> than
>> > > >> the recovery point? Does the lazy loading itself give some
>> additional
>> > > >> benefit beyond skipping the sanity check? As Jay mentions above,
>> the
>> > > >> sanity
>> > > >> checks seem strictly speaking optional. We don't bother checking
>> the
>> > > >> segments themselves for example.
>> > > >>
>> > > >> Thanks,
>> > > >> Jason
>> > > >>
>> > > >>
>> > > >>
>> > > >>
>> > > >> It probably still makes sense for segments beyond the recovery
>> point
>> > > >>
>> > > >> On Wed, Mar 21, 2018 at 9:59 PM, Dong Lin <li...@gmail.com>
>> > wrote:
>> > > >>
>> > > >> > Hey Jay,
>> > > >> >
>> > > >> > Yeah our existing sanity check only read the last entry in the
>> index
>> > > >> files.
>> > > >> > I must have miscommunicated if I previously said it was reading
>> the
>> > > full
>> > > >> > index. Broker appears to be spending a lot of time just to read
>> the
>> > > last
>> > > >> > entry of index files for every log segment. This is probably
>> because
>> > > OS
>> > > >> > will load a chunk of data that is much larger than the entry
>> itself
>> > > from
>> > > >> > disk to page cache. This KIP tries to make this part of operation
>> > > lazy.
>> > > >> I
>> > > >> > guess you are suggesting that we should just make the lazy
>> loading
>> > the
>> > > >> > default behavior?
>> > > >> >
>> > > >> > Yes we currently require manual intervention if the log file is
>> > > >> corrupted,
>> > > >> > i.e. if two messages with the same offset are appended to the
>> disk
>> > > >> > (KAFKA-6488). The sanity check on broker startup is a bit
>> different
>> > > >> since
>> > > >> > it deals with the corruption of index files (e.g. offset index,
>> time
>> > > >> index
>> > > >> > and snapshot files) instead of the log data. In this case if
>> index
>> > > files
>> > > >> > are corrupted broker will automatically recover it by rebuilding
>> the
>> > > >> index
>> > > >> > files using data in the log files, without requiring manual
>> > > >> intervention.
>> > > >> > Thus the design question is whether this should be done before
>> > broker
>> > > >> can
>> > > >> > become leader for any partitions -- there is tradeoff between
>> broker
>> > > >> > startup time and risk of delaying user requests if broker need to
>> > > >> rebuild
>> > > >> > index files when it is already leader. I prefer lazy loading to
>> > reduce
>> > > >> > broker startup time. Not sure what are the feedback from the
>> > community
>> > > >> on
>> > > >> > this issue.
>> > > >> >
>> > > >> > Thanks,
>> > > >> > Dong
>> > > >> >
>> > > >> >
>> > > >> > On Wed, Mar 21, 2018 at 7:36 AM, Jay Kreps <ja...@confluent.io>
>> > wrote:
>> > > >> >
>> > > >> > > Hey Dong,
>> > > >> > >
>> > > >> > > Makes total sense. What I'm saying is I don't think that the
>> > sanity
>> > > >> check
>> > > >> > > is part of any formal guarantee we provide. It is true that
>> > > >> corruption of
>> > > >> > > data flushed to disk will be a potential problem, but I don't
>> > think
>> > > >> the
>> > > >> > > sanity check solves that it just has a couple heuristics to
>> help
>> > > >> detect
>> > > >> > > certain possible instances of it, right? In general I think our
>> > > >> > assumption
>> > > >> > > has been that flushed data doesn't disappear or get corrupted
>> and
>> > if
>> > > >> it
>> > > >> > > does you need to manually intervene. I don't think people want
>> to
>> > > >> > configure
>> > > >> > > things at this level so what I was suggesting was understanding
>> > why
>> > > >> the
>> > > >> > > sanity check is slow and trying to avoid that rather than
>> making
>> > it
>> > > >> > > configurable. I think you mentioned it was reading the full
>> index
>> > > into
>> > > >> > > memory. Based on the performance you describe this could be
>> true,
>> > > but
>> > > >> it
>> > > >> > > definitely should not be reading anything but the last entry in
>> > the
>> > > >> index
>> > > >> > > so that would be a bug. That read also happens in sanityCheck()
>> > only
>> > > >> in
>> > > >> > the
>> > > >> > > time-based index right? In the offset index we do the same read
>> > but
>> > > it
>> > > >> > > happens in initialization. If that read is the slow thing it
>> might
>> > > >> make
>> > > >> > > sense to try to remove it or make it lazy in both cases. If it
>> is
>> > > some
>> > > >> > > other part of the code then (e.g. the size check) then that
>> may be
>> > > >> able
>> > > >> > to
>> > > >> > > be avoided entirely (I think by the time we sanity check we
>> > already
>> > > >> know
>> > > >> > > the file size from the mapping...). That was what I meant by
>> doing
>> > > >> some
>> > > >> > > data driven analysis. Maybe a quick run with hprof would help
>> > > >> determine
>> > > >> > the
>> > > >> > > root cause of why sanityCheck is slow?
>> > > >> > >
>> > > >> > > -Jay
>> > > >> > >
>> > > >> > > On Tue, Mar 20, 2018 at 12:13 AM Dong Lin <lindong28@gmail.com
>> >
>> > > >> wrote:
>> > > >> > >
>> > > >> > > > Hey Jay,
>> > > >> > > >
>> > > >> > > > Thanks for your comments!
>> > > >> > > >
>> > > >> > > > Yeah recovery is different from the sanity check. They are
>> > > >> correlated
>> > > >> > in
>> > > >> > > > the sense that there may still be corrupted index files even
>> > after
>> > > >> > clean
>> > > >> > > > broker shutdown. And in that case if we delay the sanity
>> check
>> > > then
>> > > >> we
>> > > >> > > may
>> > > >> > > > delay the log recovery. The main goal of this KIP is to
>> optimize
>> > > the
>> > > >> > > sanity
>> > > >> > > > check related work so that it does not delay the broker
>> startup
>> > > >> much.
>> > > >> > > >
>> > > >> > > > The KIP mentioned that the sanity check is done using log
>> > recovery
>> > > >> > > > background thread. The name "recovery" is mentioned mainly
>> > because
>> > > >> the
>> > > >> > > > background thread number is determined using the existing
>> > > >> > > > config num.recovery.threads.per.data.dir. I have updated
>> the KIP
>> > > to
>> > > >> > make
>> > > >> > > > this less confusing.
>> > > >> > > >
>> > > >> > > > It makes a ton of sense to optimize the broker startup time
>> in a
>> > > >> data
>> > > >> > > > driven fashion. The currently optimize is done kind of in
>> this
>> > > >> fashion.
>> > > >> > > The
>> > > >> > > > broker log shows that LogManager.loadLogs() takes a long
>> time in
>> > > >> large
>> > > >> > > > clusters. Then I started broker with cold cache and
>> repeatedly
>> > get
>> > > >> > thread
>> > > >> > > > dump to see what are broker threads are doing during
>> > > >> > > LogManager.loadLogs().
>> > > >> > > > Most of the threads are working on sanityCheck() and this
>> > > motivates
>> > > >> the
>> > > >> > > > change in this KIP. Previously broker shutdown time was
>> > > investigated
>> > > >> > in a
>> > > >> > > > similar data driven fashion and optimized with KAFKA-6172 and
>> > > >> > KAFKA-6175.
>> > > >> > > > It seems that the current KIP can reduces the rolling bounce
>> > time
>> > > >> of a
>> > > >> > > > large cluster by 50% -- there may be room for further
>> > improvement
>> > > >> but
>> > > >> > > maybe
>> > > >> > > > those do not require as big a change (with the caveat
>> described
>> > in
>> > > >> the
>> > > >> > > KIP)
>> > > >> > > > as suggested in this KIP.
>> > > >> > > >
>> > > >> > > > It is not clear whether it is safe to just read the latest
>> > segment
>> > > >> > > without
>> > > >> > > > sanity checking all previous inactive segment of a given
>> > partition
>> > > >> if
>> > > >> > > > transaction is used. Otherwise we probably want to always
>> skip
>> > the
>> > > >> > sanity
>> > > >> > > > check of inactive segments without introducing a new config.
>> > Maybe
>> > > >> the
>> > > >> > > > developers familiar with the transaction can comment on that?
>> > > >> > > >
>> > > >> > > > Thanks,
>> > > >> > > > Dong
>> > > >> > > >
>> > > >> > > >
>> > > >> > > > On Mon, Mar 19, 2018 at 7:21 PM, Jay Kreps <jay@confluent.io
>> >
>> > > >> wrote:
>> > > >> > > >
>> > > >> > > > > Optimizing startup seems really valuable but I'm a little
>> > > >> confused by
>> > > >> > > > this.
>> > > >> > > > >
>> > > >> > > > > There are two different things:
>> > > >> > > > > 1. Recovery
>> > > >> > > > > 2. Sanity check
>> > > >> > > > >
>> > > >> > > > > The terminology we're using is a bit mixed here.
>> > > >> > > > >
>> > > >> > > > > Recovery means checksumming the log segments and rebuilding
>> > the
>> > > >> index
>> > > >> > > on
>> > > >> > > > a
>> > > >> > > > > hard crash. This only happens on unflushed segments, which
>> is
>> > > >> > generally
>> > > >> > > > > just the last segment. Recovery is essential for the
>> > correctness
>> > > >> > > > guarantees
>> > > >> > > > > of the log and you shouldn't disable it. It only happens on
>> > hard
>> > > >> > crash
>> > > >> > > > and
>> > > >> > > > > is not a factor in graceful restart. We can likely
>> optimize it
>> > > but
>> > > >> > that
>> > > >> > > > > would make most sense to do in a data driven fashion off
>> some
>> > > >> > > profiling.
>> > > >> > > > >
>> > > >> > > > > However there is also a ton of disk activity that happens
>> > during
>> > > >> > > > > initialization (lots of checks on the file size, absolute
>> > path,
>> > > >> > etc). I
>> > > >> > > > > think these have crept in over time with people not really
>> > > >> realizing
>> > > >> > > this
>> > > >> > > > > code is perf sensitive and java hiding a lot of what is and
>> > > isn't
>> > > >> a
>> > > >> > > file
>> > > >> > > > > operation. One part of this is the sanityCheck() call for
>> the
>> > > two
>> > > >> > > > indexes.
>> > > >> > > > > I don't think this call reads the full index, just the last
>> > > entry
>> > > >> in
>> > > >> > > the
>> > > >> > > > > index, right?. There should be no need to read the full
>> index
>> > > >> except
>> > > >> > > > during
>> > > >> > > > > recovery (and then only for the segments being recovered).
>> I
>> > > >> think it
>> > > >> > > > would
>> > > >> > > > > make a ton of sense to optimize this but I don't think that
>> > > >> > > optimization
>> > > >> > > > > needs to be configurable as this is just a helpful sanity
>> > check
>> > > to
>> > > >> > > detect
>> > > >> > > > > common non-sensical things in the index files, but it isn't
>> > part
>> > > >> of
>> > > >> > the
>> > > >> > > > > core guarantees, in general you aren't supposed to lose
>> > > committed
>> > > >> > data
>> > > >> > > > from
>> > > >> > > > > disk, and if you do we may be able to fail faster but we
>> > > >> > fundamentally
>> > > >> > > > > can't really help you. Again I think this would make the
>> most
>> > > >> sense
>> > > >> > to
>> > > >> > > do
>> > > >> > > > > in a data driven way, if you look at that code I think it
>> is
>> > > doing
>> > > >> > > crazy
>> > > >> > > > > amounts of file operations (e.g. getAbsolutePath, file
>> sizes,
>> > > >> etc). I
>> > > >> > > > think
>> > > >> > > > > it'd make most sense to profile startup with a cold cash
>> on a
>> > > >> large
>> > > >> > log
>> > > >> > > > > directory and do the same with an strace to see how many
>> > > redundant
>> > > >> > > system
>> > > >> > > > > calls we do per segment and what is costing us and then cut
>> > some
>> > > >> of
>> > > >> > > this
>> > > >> > > > > out. I suspect we could speed up our startup time quite a
>> lot
>> > if
>> > > >> we
>> > > >> > did
>> > > >> > > > > that.
>> > > >> > > > >
>> > > >> > > > > For example we have a bunch of calls like this:
>> > > >> > > > >
>> > > >> > > > >     require(len % entrySize == 0,
>> > > >> > > > >
>> > > >> > > > >             "Index file " + file.getAbsolutePath + " is
>> > corrupt,
>> > > >> > found
>> > > >> > > "
>> > > >> > > > +
>> > > >> > > > > len +
>> > > >> > > > >
>> > > >> > > > >             " bytes which is not positive or not a
>> multiple of
>> > > >> 8.")
>> > > >> > > > > I'm pretty such file.getAbsolutePath is a system call and I
>> > > assume
>> > > >> > that
>> > > >> > > > > happens whether or not you fail the in-memory check?
>> > > >> > > > >
>> > > >> > > > > -Jay
>> > > >> > > > >
>> > > >> > > > >
>> > > >> > > > > On Sun, Feb 25, 2018 at 10:27 PM, Dong Lin <
>> > lindong28@gmail.com
>> > > >
>> > > >> > > wrote:
>> > > >> > > > >
>> > > >> > > > > > Hi all,
>> > > >> > > > > >
>> > > >> > > > > > I have created KIP-263: Allow broker to skip sanity
>> check of
>> > > >> > inactive
>> > > >> > > > > > segments on broker startup. See
>> > > >> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> > > >> > > > > > 263%3A+Allow+broker+to+skip+sanity+check+of+inactive+
>> > > >> > > > > > segments+on+broker+startup
>> > > >> > > > > > .
>> > > >> > > > > >
>> > > >> > > > > > This KIP provides a way to significantly reduce time to
>> > > rolling
>> > > >> > > bounce
>> > > >> > > > a
>> > > >> > > > > > Kafka cluster.
>> > > >> > > > > >
>> > > >> > > > > > Comments are welcome!
>> > > >> > > > > >
>> > > >> > > > > > Thanks,
>> > > >> > > > > > Dong
>> > > >> > > > > >
>> > > >> > > > >
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [DISCUSS] KIP-263: Allow broker to skip sanity check of inactive segments on broker startup

Posted by Dong Lin <li...@gmail.com>.

Here is some information related to the KIP.

Previously we thought we can ignore the sanity check by default to speedup
broker startup and there is no need for extra configuration. However based
on the code and some previous experiment result it is not clear whether it
will work. When LogManager.cleanupLogs() is called by the background
thread, the thread will go over all segments and read the last entry of
timeindex file of each segment to see whether the segment should be deleted
based on time. This may cause spike in the disk usage and reduce broker
startup time.

More test is needed to validate whether this is an issue. If this is an
issue, we have validated that the issue can be solved by using a few extra
configs. More information will be posted later when we get the result.



On Wed, Jun 27, 2018 at 10:59 AM, Dong Lin <li...@gmail.com> wrote:

> Thanks for the reply Jason and Dhruvil.
>
> Yeah we don't need config for the sanity check and thus we don't need a
> KIP. I think we are on the same page of just skipping the sanity check of
> segments before the recovery offset. I will close the KIP and submit a
> patch for this.
>
> On Wed, Jun 27, 2018 at 10:09 AM, Dhruvil Shah <dh...@confluent.io>
> wrote:
>
>> +1 to what Jason said. We need a better long-term strategy for dealing
>> with
>> corrupted log and index data, but the sanity checks we have do not
>> guarantee much in this regard.
>>
>> For now, we could do away with these index sanity checks in my opinion. We
>> could handle the missing index case at startup. I think we could have
>> missing index files only when users are upgrading from a version that did
>> not have a particular type of index to a version that does, or if the
>> operator physically deleted these files. Because these are rare scenarios,
>> having to recreate a missing index should typically not affect normal
>> startup time.
>>
>> - Dhruvil
>>
>> On Wed, Jun 27, 2018 at 8:47 AM Jason Gustafson <ja...@confluent.io>
>> wrote:
>>
>> > Hey Dong,
>> >
>> >
>> > So the main concern with the above approach is that, if for any reason
>> the
>> > > index files of inactive segment is deleted or corrupted, the broker
>> will
>> > > halt if there is only one log directory. This is different from the
>> > > existing behavior where the broker will rebuild the index for this
>> > inactive
>> > > segment before it can accept any request from consumer. Though we
>> don't
>> > > have provide guarantee for segments already flushed to disk, this
>> still
>> > > seems like a change in behavior for user. Maybe we don't have to worry
>> > > about this if we decide it is very rare, e.g. it happens only when
>> there
>> > is
>> > > disk error or when there is human error.
>> >
>> >
>> > I think we should probably still handle the case when an index file is
>> > missing during startup? But considering how weak the sanity check is, it
>> > seems fine to skip it.  Also, could we just make this change without a
>> KIP?
>> > Adding a config to enable a wimpy sanity check seems unnecessary.
>> >
>> > One scenario that does come up with users is actual segment corruption,
>> > which is only detected by consumers that are validating CRCs. To fix
>> it, we
>> > have to manually delete the segments and force re-replication. It would
>> be
>> > helpful to have a config to enable deep checking on startup for
>> particular
>> > topics or partitions. This could also just be a separate tool though
>> > ("kafka-fsck" or something).
>> >
>> > Thinking longer term, I think we need a more systematic approach to
>> dealing
>> > with corruption, not just in index files, but in the segments as well.
>> It
>> > might be nice, for example, if the consumer had a way to hint the broker
>> > that a particular offset is corrupt. The leader might then demote
>> itself,
>> > for example, and try to repair. Lots to think through though.
>> >
>> > -Jason
>> >
>> >
>> >
>> >
>> > On Wed, Jun 27, 2018 at 12:29 AM, Dong Lin <li...@gmail.com> wrote:
>> >
>> > > So the main concern with the above approach is that, if for any reason
>> > the
>> > > index files of inactive segment is deleted or corrupted, the broker
>> will
>> > > halt if there is only one log directory. This is different from the
>> > > existing behavior where the broker will rebuild the index for this
>> > inactive
>> > > segment before it can accept any request from consumer. Though we
>> don't
>> > > have provide guarantee for segments already flushed to disk, this
>> still
>> > > seems like a change in behavior for user. Maybe we don't have to worry
>> > > about this if we decide it is very rare, e.g. it happens only when
>> there
>> > is
>> > > disk error or when there is human error.
>> > >
>> > >
>> > >
>> > > On Wed, Jun 27, 2018 at 12:04 AM, Dong Lin <li...@gmail.com>
>> wrote:
>> > >
>> > > > Hey Jason,
>> > > >
>> > > > Thanks for the comment!
>> > > >
>> > > > Your comment reminded me to read through Jay's comments and my reply
>> > > > again. It seems that I probably have not captured idea of Jay's
>> comment
>> > > > that says sanity check is not part of any formal guarantee we
>> provide.
>> > I
>> > > > probably should have thought about this comment more. Let me reply
>> to
>> > > both
>> > > > yours and Jay's comment and see if I can understand you better.
>> > > >
>> > > > Here are some clarifications:
>> > > > - KIP does not intend to optimize recovery. It aims to optimize the
>> the
>> > > > sanity check when there is clean shutdown.
>> > > > - Sanity check only read the last entry of the index rather than the
>> > full
>> > > > index
>> > > > - We have already done data driven investigation though it is not
>> done
>> > > > using hprof or strace. The resulting rolling bounce time is
>> acceptable
>> > > now.
>> > > > If it appears to be an issue e.g. after more data then we may need
>> to
>> > > > revisit this with more data driven investigation
>> > > >
>> > > > I agree with the following comments:
>> > > > - We should optimize the default behavior instead of adding a new
>> > config.
>> > > > - sanity check of the segments before recovery offset is not part of
>> > any
>> > > > formal guarantee and thus we probably can just skip it.
>> > > >
>> > > > So we are all leaning towards skipping the sanity check of all
>> segments
>> > > > before the recovery offset. This solution would be pretty
>> > straightforward
>> > > > to understand and implement. And I am sure it will give us all the
>> > > benefits
>> > > > that this KIP intends to achieve. Here is only one question to
>> double
>> > > check:
>> > > >
>> > > > If consumer fetches from an inactive segment, broker will just use
>> the
>> > > > index of that inactive segment. If anything goes wrong, e.g. the
>> index
>> > > file
>> > > > is corrupted or the index file does not exist, then the broker will
>> > just
>> > > > consider it as IOException, mark the disk and the partitions on the
>> > disk
>> > > > offline and respond KafkaStorageException to consumer. Does this
>> sound
>> > > OK?
>> > > > One alternative solution is to let broker rebuild index. But this
>> > > > alternative solution is inconsistent with the idea that "sanity
>> check
>> > is
>> > > not
>> > > > part of any formal guarantee" and it may tie up all request handler
>> > > > thread for rebuilding the indexed.
>> > > >
>> > > >
>> > > > If this solution sounds right, I will update the KIP accordingly.
>> > > >
>> > > > Thanks,
>> > > > Dong
>> > > >
>> > > > On Tue, Jun 26, 2018 at 3:23 PM, Jason Gustafson <
>> jason@confluent.io>
>> > > > wrote:
>> > > >
>> > > >> Hey Dong,
>> > > >>
>> > > >> Sorry for being slow to catch up to this.
>> > > >>
>> > > >> I think the benefit of the sanity check seems a little dubious in
>> the
>> > > >> first
>> > > >> place. We detect garbage at the end of the index file, but that's
>> > about
>> > > >> it.
>> > > >> Is there any reason to think that corruption is more likely to
>> occur
>> > > there
>> > > >> or any other reason to think this check is still beneficial for
>> > flushed
>> > > >> data? I assume we did the check because we presumed it was cheap,
>> but
>> > > >> perhaps the cost is adding up as the number of partitions grows.
>> How
>> > > much
>> > > >> does startup time improve if we skip the sanity check for data
>> earlier
>> > > >> than
>> > > >> the recovery point? Does the lazy loading itself give some
>> additional
>> > > >> benefit beyond skipping the sanity check? As Jay mentions above,
>> the
>> > > >> sanity
>> > > >> checks seem strictly speaking optional. We don't bother checking
>> the
>> > > >> segments themselves for example.
>> > > >>
>> > > >> Thanks,
>> > > >> Jason
>> > > >>
>> > > >>
>> > > >>
>> > > >>
>> > > >> It probably still makes sense for segments beyond the recovery
>> point
>> > > >>
>> > > >> On Wed, Mar 21, 2018 at 9:59 PM, Dong Lin <li...@gmail.com>
>> > wrote:
>> > > >>
>> > > >> > Hey Jay,
>> > > >> >
>> > > >> > Yeah our existing sanity check only read the last entry in the
>> index
>> > > >> files.
>> > > >> > I must have miscommunicated if I previously said it was reading
>> the
>> > > full
>> > > >> > index. Broker appears to be spending a lot of time just to read
>> the
>> > > last
>> > > >> > entry of index files for every log segment. This is probably
>> because
>> > > OS
>> > > >> > will load a chunk of data that is much larger than the entry
>> itself
>> > > from
>> > > >> > disk to page cache. This KIP tries to make this part of operation
>> > > lazy.
>> > > >> I
>> > > >> > guess you are suggesting that we should just make the lazy
>> loading
>> > the
>> > > >> > default behavior?
>> > > >> >
>> > > >> > Yes we currently require manual intervention if the log file is
>> > > >> corrupted,
>> > > >> > i.e. if two messages with the same offset are appended to the
>> disk
>> > > >> > (KAFKA-6488). The sanity check on broker startup is a bit
>> different
>> > > >> since
>> > > >> > it deals with the corruption of index files (e.g. offset index,
>> time
>> > > >> index
>> > > >> > and snapshot files) instead of the log data. In this case if
>> index
>> > > files
>> > > >> > are corrupted broker will automatically recover it by rebuilding
>> the
>> > > >> index
>> > > >> > files using data in the log files, without requiring manual
>> > > >> intervention.
>> > > >> > Thus the design question is whether this should be done before
>> > broker
>> > > >> can
>> > > >> > become leader for any partitions -- there is tradeoff between
>> broker
>> > > >> > startup time and risk of delaying user requests if broker need to
>> > > >> rebuild
>> > > >> > index files when it is already leader. I prefer lazy loading to
>> > reduce
>> > > >> > broker startup time. Not sure what are the feedback from the
>> > community
>> > > >> on
>> > > >> > this issue.
>> > > >> >
>> > > >> > Thanks,
>> > > >> > Dong
>> > > >> >
>> > > >> >
>> > > >> > On Wed, Mar 21, 2018 at 7:36 AM, Jay Kreps <ja...@confluent.io>
>> > wrote:
>> > > >> >
>> > > >> > > Hey Dong,
>> > > >> > >
>> > > >> > > Makes total sense. What I'm saying is I don't think that the
>> > sanity
>> > > >> check
>> > > >> > > is part of any formal guarantee we provide. It is true that
>> > > >> corruption of
>> > > >> > > data flushed to disk will be a potential problem, but I don't
>> > think
>> > > >> the
>> > > >> > > sanity check solves that it just has a couple heuristics to
>> help
>> > > >> detect
>> > > >> > > certain possible instances of it, right? In general I think our
>> > > >> > assumption
>> > > >> > > has been that flushed data doesn't disappear or get corrupted
>> and
>> > if
>> > > >> it
>> > > >> > > does you need to manually intervene. I don't think people want
>> to
>> > > >> > configure
>> > > >> > > things at this level so what I was suggesting was understanding
>> > why
>> > > >> the
>> > > >> > > sanity check is slow and trying to avoid that rather than
>> making
>> > it
>> > > >> > > configurable. I think you mentioned it was reading the full
>> index
>> > > into
>> > > >> > > memory. Based on the performance you describe this could be
>> true,
>> > > but
>> > > >> it
>> > > >> > > definitely should not be reading anything but the last entry in
>> > the
>> > > >> index
>> > > >> > > so that would be a bug. That read also happens in sanityCheck()
>> > only
>> > > >> in
>> > > >> > the
>> > > >> > > time-based index right? In the offset index we do the same read
>> > but
>> > > it
>> > > >> > > happens in initialization. If that read is the slow thing it
>> might
>> > > >> make
>> > > >> > > sense to try to remove it or make it lazy in both cases. If it
>> is
>> > > some
>> > > >> > > other part of the code then (e.g. the size check) then that
>> may be
>> > > >> able
>> > > >> > to
>> > > >> > > be avoided entirely (I think by the time we sanity check we
>> > already
>> > > >> know
>> > > >> > > the file size from the mapping...). That was what I meant by
>> doing
>> > > >> some
>> > > >> > > data driven analysis. Maybe a quick run with hprof would help
>> > > >> determine
>> > > >> > the
>> > > >> > > root cause of why sanityCheck is slow?
>> > > >> > >
>> > > >> > > -Jay
>> > > >> > >
>> > > >> > > On Tue, Mar 20, 2018 at 12:13 AM Dong Lin <lindong28@gmail.com
>> >
>> > > >> wrote:
>> > > >> > >
>> > > >> > > > Hey Jay,
>> > > >> > > >
>> > > >> > > > Thanks for your comments!
>> > > >> > > >
>> > > >> > > > Yeah recovery is different from the sanity check. They are
>> > > >> correlated
>> > > >> > in
>> > > >> > > > the sense that there may still be corrupted index files even
>> > after
>> > > >> > clean
>> > > >> > > > broker shutdown. And in that case if we delay the sanity
>> check
>> > > then
>> > > >> we
>> > > >> > > may
>> > > >> > > > delay the log recovery. The main goal of this KIP is to
>> optimize
>> > > the
>> > > >> > > sanity
>> > > >> > > > check related work so that it does not delay the broker
>> startup
>> > > >> much.
>> > > >> > > >
>> > > >> > > > The KIP mentioned that the sanity check is done using log
>> > recovery
>> > > >> > > > background thread. The name "recovery" is mentioned mainly
>> > because
>> > > >> the
>> > > >> > > > background thread number is determined using the existing
>> > > >> > > > config num.recovery.threads.per.data.dir. I have updated
>> the KIP
>> > > to
>> > > >> > make
>> > > >> > > > this less confusing.
>> > > >> > > >
>> > > >> > > > It makes a ton of sense to optimize the broker startup time
>> in a
>> > > >> data
>> > > >> > > > driven fashion. The currently optimize is done kind of in
>> this
>> > > >> fashion.
>> > > >> > > The
>> > > >> > > > broker log shows that LogManager.loadLogs() takes a long
>> time in
>> > > >> large
>> > > >> > > > clusters. Then I started broker with cold cache and
>> repeatedly
>> > get
>> > > >> > thread
>> > > >> > > > dump to see what are broker threads are doing during
>> > > >> > > LogManager.loadLogs().
>> > > >> > > > Most of the threads are working on sanityCheck() and this
>> > > motivates
>> > > >> the
>> > > >> > > > change in this KIP. Previously broker shutdown time was
>> > > investigated
>> > > >> > in a
>> > > >> > > > similar data driven fashion and optimized with KAFKA-6172 and
>> > > >> > KAFKA-6175.
>> > > >> > > > It seems that the current KIP can reduces the rolling bounce
>> > time
>> > > >> of a
>> > > >> > > > large cluster by 50% -- there may be room for further
>> > improvement
>> > > >> but
>> > > >> > > maybe
>> > > >> > > > those do not require as big a change (with the caveat
>> described
>> > in
>> > > >> the
>> > > >> > > KIP)
>> > > >> > > > as suggested in this KIP.
>> > > >> > > >
>> > > >> > > > It is not clear whether it is safe to just read the latest
>> > segment
>> > > >> > > without
>> > > >> > > > sanity checking all previous inactive segment of a given
>> > partition
>> > > >> if
>> > > >> > > > transaction is used. Otherwise we probably want to always
>> skip
>> > the
>> > > >> > sanity
>> > > >> > > > check of inactive segments without introducing a new config.
>> > Maybe
>> > > >> the
>> > > >> > > > developers familiar with the transaction can comment on that?
>> > > >> > > >
>> > > >> > > > Thanks,
>> > > >> > > > Dong
>> > > >> > > >
>> > > >> > > >
>> > > >> > > > On Mon, Mar 19, 2018 at 7:21 PM, Jay Kreps <jay@confluent.io
>> >
>> > > >> wrote:
>> > > >> > > >
>> > > >> > > > > Optimizing startup seems really valuable but I'm a little
>> > > >> confused by
>> > > >> > > > this.
>> > > >> > > > >
>> > > >> > > > > There are two different things:
>> > > >> > > > > 1. Recovery
>> > > >> > > > > 2. Sanity check
>> > > >> > > > >
>> > > >> > > > > The terminology we're using is a bit mixed here.
>> > > >> > > > >
>> > > >> > > > > Recovery means checksumming the log segments and rebuilding
>> > the
>> > > >> index
>> > > >> > > on
>> > > >> > > > a
>> > > >> > > > > hard crash. This only happens on unflushed segments, which
>> is
>> > > >> > generally
>> > > >> > > > > just the last segment. Recovery is essential for the
>> > correctness
>> > > >> > > > guarantees
>> > > >> > > > > of the log and you shouldn't disable it. It only happens on
>> > hard
>> > > >> > crash
>> > > >> > > > and
>> > > >> > > > > is not a factor in graceful restart. We can likely
>> optimize it
>> > > but
>> > > >> > that
>> > > >> > > > > would make most sense to do in a data driven fashion off
>> some
>> > > >> > > profiling.
>> > > >> > > > >
>> > > >> > > > > However there is also a ton of disk activity that happens
>> > during
>> > > >> > > > > initialization (lots of checks on the file size, absolute
>> > path,
>> > > >> > etc). I
>> > > >> > > > > think these have crept in over time with people not really
>> > > >> realizing
>> > > >> > > this
>> > > >> > > > > code is perf sensitive and java hiding a lot of what is and
>> > > isn't
>> > > >> a
>> > > >> > > file
>> > > >> > > > > operation. One part of this is the sanityCheck() call for
>> the
>> > > two
>> > > >> > > > indexes.
>> > > >> > > > > I don't think this call reads the full index, just the last
>> > > entry
>> > > >> in
>> > > >> > > the
>> > > >> > > > > index, right?. There should be no need to read the full
>> index
>> > > >> except
>> > > >> > > > during
>> > > >> > > > > recovery (and then only for the segments being recovered).
>> I
>> > > >> think it
>> > > >> > > > would
>> > > >> > > > > make a ton of sense to optimize this but I don't think that
>> > > >> > > optimization
>> > > >> > > > > needs to be configurable as this is just a helpful sanity
>> > check
>> > > to
>> > > >> > > detect
>> > > >> > > > > common non-sensical things in the index files, but it isn't
>> > part
>> > > >> of
>> > > >> > the
>> > > >> > > > > core guarantees, in general you aren't supposed to lose
>> > > committed
>> > > >> > data
>> > > >> > > > from
>> > > >> > > > > disk, and if you do we may be able to fail faster but we
>> > > >> > fundamentally
>> > > >> > > > > can't really help you. Again I think this would make the
>> most
>> > > >> sense
>> > > >> > to
>> > > >> > > do
>> > > >> > > > > in a data driven way, if you look at that code I think it
>> is
>> > > doing
>> > > >> > > crazy
>> > > >> > > > > amounts of file operations (e.g. getAbsolutePath, file
>> sizes,
>> > > >> etc). I
>> > > >> > > > think
>> > > >> > > > > it'd make most sense to profile startup with a cold cash
>> on a
>> > > >> large
>> > > >> > log
>> > > >> > > > > directory and do the same with an strace to see how many
>> > > redundant
>> > > >> > > system
>> > > >> > > > > calls we do per segment and what is costing us and then cut
>> > some
>> > > >> of
>> > > >> > > this
>> > > >> > > > > out. I suspect we could speed up our startup time quite a
>> lot
>> > if
>> > > >> we
>> > > >> > did
>> > > >> > > > > that.
>> > > >> > > > >
>> > > >> > > > > For example we have a bunch of calls like this:
>> > > >> > > > >
>> > > >> > > > >     require(len % entrySize == 0,
>> > > >> > > > >
>> > > >> > > > >             "Index file " + file.getAbsolutePath + " is
>> > corrupt,
>> > > >> > found
>> > > >> > > "
>> > > >> > > > +
>> > > >> > > > > len +
>> > > >> > > > >
>> > > >> > > > >             " bytes which is not positive or not a
>> multiple of
>> > > >> 8.")
>> > > >> > > > > I'm pretty such file.getAbsolutePath is a system call and I
>> > > assume
>> > > >> > that
>> > > >> > > > > happens whether or not you fail the in-memory check?
>> > > >> > > > >
>> > > >> > > > > -Jay
>> > > >> > > > >
>> > > >> > > > >
>> > > >> > > > > On Sun, Feb 25, 2018 at 10:27 PM, Dong Lin <
>> > lindong28@gmail.com
>> > > >
>> > > >> > > wrote:
>> > > >> > > > >
>> > > >> > > > > > Hi all,
>> > > >> > > > > >
>> > > >> > > > > > I have created KIP-263: Allow broker to skip sanity
>> check of
>> > > >> > inactive
>> > > >> > > > > > segments on broker startup. See
>> > > >> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> > > >> > > > > > 263%3A+Allow+broker+to+skip+sanity+check+of+inactive+
>> > > >> > > > > > segments+on+broker+startup
>> > > >> > > > > > .
>> > > >> > > > > >
>> > > >> > > > > > This KIP provides a way to significantly reduce time to
>> > > rolling
>> > > >> > > bounce
>> > > >> > > > a
>> > > >> > > > > > Kafka cluster.
>> > > >> > > > > >
>> > > >> > > > > > Comments are welcome!
>> > > >> > > > > >
>> > > >> > > > > > Thanks,
>> > > >> > > > > > Dong
>> > > >> > > > > >
>> > > >> > > > >
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [DISCUSS] KIP-263: Allow broker to skip sanity check of inactive segments on broker startup

Posted by Dong Lin <li...@gmail.com>.

Thanks for the reply Jason and Dhruvil.

Yeah we don't need config for the sanity check and thus we don't need a
KIP. I think we are on the same page of just skipping the sanity check of
segments before the recovery offset. I will close the KIP and submit a
patch for this.

On Wed, Jun 27, 2018 at 10:09 AM, Dhruvil Shah <dh...@confluent.io> wrote:

> +1 to what Jason said. We need a better long-term strategy for dealing with
> corrupted log and index data, but the sanity checks we have do not
> guarantee much in this regard.
>
> For now, we could do away with these index sanity checks in my opinion. We
> could handle the missing index case at startup. I think we could have
> missing index files only when users are upgrading from a version that did
> not have a particular type of index to a version that does, or if the
> operator physically deleted these files. Because these are rare scenarios,
> having to recreate a missing index should typically not affect normal
> startup time.
>
> - Dhruvil
>
> On Wed, Jun 27, 2018 at 8:47 AM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hey Dong,
> >
> >
> > So the main concern with the above approach is that, if for any reason
> the
> > > index files of inactive segment is deleted or corrupted, the broker
> will
> > > halt if there is only one log directory. This is different from the
> > > existing behavior where the broker will rebuild the index for this
> > inactive
> > > segment before it can accept any request from consumer. Though we don't
> > > have provide guarantee for segments already flushed to disk, this still
> > > seems like a change in behavior for user. Maybe we don't have to worry
> > > about this if we decide it is very rare, e.g. it happens only when
> there
> > is
> > > disk error or when there is human error.
> >
> >
> > I think we should probably still handle the case when an index file is
> > missing during startup? But considering how weak the sanity check is, it
> > seems fine to skip it.  Also, could we just make this change without a
> KIP?
> > Adding a config to enable a wimpy sanity check seems unnecessary.
> >
> > One scenario that does come up with users is actual segment corruption,
> > which is only detected by consumers that are validating CRCs. To fix it,
> we
> > have to manually delete the segments and force re-replication. It would
> be
> > helpful to have a config to enable deep checking on startup for
> particular
> > topics or partitions. This could also just be a separate tool though
> > ("kafka-fsck" or something).
> >
> > Thinking longer term, I think we need a more systematic approach to
> dealing
> > with corruption, not just in index files, but in the segments as well. It
> > might be nice, for example, if the consumer had a way to hint the broker
> > that a particular offset is corrupt. The leader might then demote itself,
> > for example, and try to repair. Lots to think through though.
> >
> > -Jason
> >
> >
> >
> >
> > On Wed, Jun 27, 2018 at 12:29 AM, Dong Lin <li...@gmail.com> wrote:
> >
> > > So the main concern with the above approach is that, if for any reason
> > the
> > > index files of inactive segment is deleted or corrupted, the broker
> will
> > > halt if there is only one log directory. This is different from the
> > > existing behavior where the broker will rebuild the index for this
> > inactive
> > > segment before it can accept any request from consumer. Though we don't
> > > have provide guarantee for segments already flushed to disk, this still
> > > seems like a change in behavior for user. Maybe we don't have to worry
> > > about this if we decide it is very rare, e.g. it happens only when
> there
> > is
> > > disk error or when there is human error.
> > >
> > >
> > >
> > > On Wed, Jun 27, 2018 at 12:04 AM, Dong Lin <li...@gmail.com>
> wrote:
> > >
> > > > Hey Jason,
> > > >
> > > > Thanks for the comment!
> > > >
> > > > Your comment reminded me to read through Jay's comments and my reply
> > > > again. It seems that I probably have not captured idea of Jay's
> comment
> > > > that says sanity check is not part of any formal guarantee we
> provide.
> > I
> > > > probably should have thought about this comment more. Let me reply to
> > > both
> > > > yours and Jay's comment and see if I can understand you better.
> > > >
> > > > Here are some clarifications:
> > > > - KIP does not intend to optimize recovery. It aims to optimize the
> the
> > > > sanity check when there is clean shutdown.
> > > > - Sanity check only read the last entry of the index rather than the
> > full
> > > > index
> > > > - We have already done data driven investigation though it is not
> done
> > > > using hprof or strace. The resulting rolling bounce time is
> acceptable
> > > now.
> > > > If it appears to be an issue e.g. after more data then we may need to
> > > > revisit this with more data driven investigation
> > > >
> > > > I agree with the following comments:
> > > > - We should optimize the default behavior instead of adding a new
> > config.
> > > > - sanity check of the segments before recovery offset is not part of
> > any
> > > > formal guarantee and thus we probably can just skip it.
> > > >
> > > > So we are all leaning towards skipping the sanity check of all
> segments
> > > > before the recovery offset. This solution would be pretty
> > straightforward
> > > > to understand and implement. And I am sure it will give us all the
> > > benefits
> > > > that this KIP intends to achieve. Here is only one question to double
> > > check:
> > > >
> > > > If consumer fetches from an inactive segment, broker will just use
> the
> > > > index of that inactive segment. If anything goes wrong, e.g. the
> index
> > > file
> > > > is corrupted or the index file does not exist, then the broker will
> > just
> > > > consider it as IOException, mark the disk and the partitions on the
> > disk
> > > > offline and respond KafkaStorageException to consumer. Does this
> sound
> > > OK?
> > > > One alternative solution is to let broker rebuild index. But this
> > > > alternative solution is inconsistent with the idea that "sanity check
> > is
> > > not
> > > > part of any formal guarantee" and it may tie up all request handler
> > > > thread for rebuilding the indexed.
> > > >
> > > >
> > > > If this solution sounds right, I will update the KIP accordingly.
> > > >
> > > > Thanks,
> > > > Dong
> > > >
> > > > On Tue, Jun 26, 2018 at 3:23 PM, Jason Gustafson <jason@confluent.io
> >
> > > > wrote:
> > > >
> > > >> Hey Dong,
> > > >>
> > > >> Sorry for being slow to catch up to this.
> > > >>
> > > >> I think the benefit of the sanity check seems a little dubious in
> the
> > > >> first
> > > >> place. We detect garbage at the end of the index file, but that's
> > about
> > > >> it.
> > > >> Is there any reason to think that corruption is more likely to occur
> > > there
> > > >> or any other reason to think this check is still beneficial for
> > flushed
> > > >> data? I assume we did the check because we presumed it was cheap,
> but
> > > >> perhaps the cost is adding up as the number of partitions grows. How
> > > much
> > > >> does startup time improve if we skip the sanity check for data
> earlier
> > > >> than
> > > >> the recovery point? Does the lazy loading itself give some
> additional
> > > >> benefit beyond skipping the sanity check? As Jay mentions above, the
> > > >> sanity
> > > >> checks seem strictly speaking optional. We don't bother checking the
> > > >> segments themselves for example.
> > > >>
> > > >> Thanks,
> > > >> Jason
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> It probably still makes sense for segments beyond the recovery point
> > > >>
> > > >> On Wed, Mar 21, 2018 at 9:59 PM, Dong Lin <li...@gmail.com>
> > wrote:
> > > >>
> > > >> > Hey Jay,
> > > >> >
> > > >> > Yeah our existing sanity check only read the last entry in the
> index
> > > >> files.
> > > >> > I must have miscommunicated if I previously said it was reading
> the
> > > full
> > > >> > index. Broker appears to be spending a lot of time just to read
> the
> > > last
> > > >> > entry of index files for every log segment. This is probably
> because
> > > OS
> > > >> > will load a chunk of data that is much larger than the entry
> itself
> > > from
> > > >> > disk to page cache. This KIP tries to make this part of operation
> > > lazy.
> > > >> I
> > > >> > guess you are suggesting that we should just make the lazy loading
> > the
> > > >> > default behavior?
> > > >> >
> > > >> > Yes we currently require manual intervention if the log file is
> > > >> corrupted,
> > > >> > i.e. if two messages with the same offset are appended to the disk
> > > >> > (KAFKA-6488). The sanity check on broker startup is a bit
> different
> > > >> since
> > > >> > it deals with the corruption of index files (e.g. offset index,
> time
> > > >> index
> > > >> > and snapshot files) instead of the log data. In this case if index
> > > files
> > > >> > are corrupted broker will automatically recover it by rebuilding
> the
> > > >> index
> > > >> > files using data in the log files, without requiring manual
> > > >> intervention.
> > > >> > Thus the design question is whether this should be done before
> > broker
> > > >> can
> > > >> > become leader for any partitions -- there is tradeoff between
> broker
> > > >> > startup time and risk of delaying user requests if broker need to
> > > >> rebuild
> > > >> > index files when it is already leader. I prefer lazy loading to
> > reduce
> > > >> > broker startup time. Not sure what are the feedback from the
> > community
> > > >> on
> > > >> > this issue.
> > > >> >
> > > >> > Thanks,
> > > >> > Dong
> > > >> >
> > > >> >
> > > >> > On Wed, Mar 21, 2018 at 7:36 AM, Jay Kreps <ja...@confluent.io>
> > wrote:
> > > >> >
> > > >> > > Hey Dong,
> > > >> > >
> > > >> > > Makes total sense. What I'm saying is I don't think that the
> > sanity
> > > >> check
> > > >> > > is part of any formal guarantee we provide. It is true that
> > > >> corruption of
> > > >> > > data flushed to disk will be a potential problem, but I don't
> > think
> > > >> the
> > > >> > > sanity check solves that it just has a couple heuristics to help
> > > >> detect
> > > >> > > certain possible instances of it, right? In general I think our
> > > >> > assumption
> > > >> > > has been that flushed data doesn't disappear or get corrupted
> and
> > if
> > > >> it
> > > >> > > does you need to manually intervene. I don't think people want
> to
> > > >> > configure
> > > >> > > things at this level so what I was suggesting was understanding
> > why
> > > >> the
> > > >> > > sanity check is slow and trying to avoid that rather than making
> > it
> > > >> > > configurable. I think you mentioned it was reading the full
> index
> > > into
> > > >> > > memory. Based on the performance you describe this could be
> true,
> > > but
> > > >> it
> > > >> > > definitely should not be reading anything but the last entry in
> > the
> > > >> index
> > > >> > > so that would be a bug. That read also happens in sanityCheck()
> > only
> > > >> in
> > > >> > the
> > > >> > > time-based index right? In the offset index we do the same read
> > but
> > > it
> > > >> > > happens in initialization. If that read is the slow thing it
> might
> > > >> make
> > > >> > > sense to try to remove it or make it lazy in both cases. If it
> is
> > > some
> > > >> > > other part of the code then (e.g. the size check) then that may
> be
> > > >> able
> > > >> > to
> > > >> > > be avoided entirely (I think by the time we sanity check we
> > already
> > > >> know
> > > >> > > the file size from the mapping...). That was what I meant by
> doing
> > > >> some
> > > >> > > data driven analysis. Maybe a quick run with hprof would help
> > > >> determine
> > > >> > the
> > > >> > > root cause of why sanityCheck is slow?
> > > >> > >
> > > >> > > -Jay
> > > >> > >
> > > >> > > On Tue, Mar 20, 2018 at 12:13 AM Dong Lin <li...@gmail.com>
> > > >> wrote:
> > > >> > >
> > > >> > > > Hey Jay,
> > > >> > > >
> > > >> > > > Thanks for your comments!
> > > >> > > >
> > > >> > > > Yeah recovery is different from the sanity check. They are
> > > >> correlated
> > > >> > in
> > > >> > > > the sense that there may still be corrupted index files even
> > after
> > > >> > clean
> > > >> > > > broker shutdown. And in that case if we delay the sanity check
> > > then
> > > >> we
> > > >> > > may
> > > >> > > > delay the log recovery. The main goal of this KIP is to
> optimize
> > > the
> > > >> > > sanity
> > > >> > > > check related work so that it does not delay the broker
> startup
> > > >> much.
> > > >> > > >
> > > >> > > > The KIP mentioned that the sanity check is done using log
> > recovery
> > > >> > > > background thread. The name "recovery" is mentioned mainly
> > because
> > > >> the
> > > >> > > > background thread number is determined using the existing
> > > >> > > > config num.recovery.threads.per.data.dir. I have updated the
> KIP
> > > to
> > > >> > make
> > > >> > > > this less confusing.
> > > >> > > >
> > > >> > > > It makes a ton of sense to optimize the broker startup time
> in a
> > > >> data
> > > >> > > > driven fashion. The currently optimize is done kind of in this
> > > >> fashion.
> > > >> > > The
> > > >> > > > broker log shows that LogManager.loadLogs() takes a long time
> in
> > > >> large
> > > >> > > > clusters. Then I started broker with cold cache and repeatedly
> > get
> > > >> > thread
> > > >> > > > dump to see what are broker threads are doing during
> > > >> > > LogManager.loadLogs().
> > > >> > > > Most of the threads are working on sanityCheck() and this
> > > motivates
> > > >> the
> > > >> > > > change in this KIP. Previously broker shutdown time was
> > > investigated
> > > >> > in a
> > > >> > > > similar data driven fashion and optimized with KAFKA-6172 and
> > > >> > KAFKA-6175.
> > > >> > > > It seems that the current KIP can reduces the rolling bounce
> > time
> > > >> of a
> > > >> > > > large cluster by 50% -- there may be room for further
> > improvement
> > > >> but
> > > >> > > maybe
> > > >> > > > those do not require as big a change (with the caveat
> described
> > in
> > > >> the
> > > >> > > KIP)
> > > >> > > > as suggested in this KIP.
> > > >> > > >
> > > >> > > > It is not clear whether it is safe to just read the latest
> > segment
> > > >> > > without
> > > >> > > > sanity checking all previous inactive segment of a given
> > partition
> > > >> if
> > > >> > > > transaction is used. Otherwise we probably want to always skip
> > the
> > > >> > sanity
> > > >> > > > check of inactive segments without introducing a new config.
> > Maybe
> > > >> the
> > > >> > > > developers familiar with the transaction can comment on that?
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > > Dong
> > > >> > > >
> > > >> > > >
> > > >> > > > On Mon, Mar 19, 2018 at 7:21 PM, Jay Kreps <ja...@confluent.io>
> > > >> wrote:
> > > >> > > >
> > > >> > > > > Optimizing startup seems really valuable but I'm a little
> > > >> confused by
> > > >> > > > this.
> > > >> > > > >
> > > >> > > > > There are two different things:
> > > >> > > > > 1. Recovery
> > > >> > > > > 2. Sanity check
> > > >> > > > >
> > > >> > > > > The terminology we're using is a bit mixed here.
> > > >> > > > >
> > > >> > > > > Recovery means checksumming the log segments and rebuilding
> > the
> > > >> index
> > > >> > > on
> > > >> > > > a
> > > >> > > > > hard crash. This only happens on unflushed segments, which
> is
> > > >> > generally
> > > >> > > > > just the last segment. Recovery is essential for the
> > correctness
> > > >> > > > guarantees
> > > >> > > > > of the log and you shouldn't disable it. It only happens on
> > hard
> > > >> > crash
> > > >> > > > and
> > > >> > > > > is not a factor in graceful restart. We can likely optimize
> it
> > > but
> > > >> > that
> > > >> > > > > would make most sense to do in a data driven fashion off
> some
> > > >> > > profiling.
> > > >> > > > >
> > > >> > > > > However there is also a ton of disk activity that happens
> > during
> > > >> > > > > initialization (lots of checks on the file size, absolute
> > path,
> > > >> > etc). I
> > > >> > > > > think these have crept in over time with people not really
> > > >> realizing
> > > >> > > this
> > > >> > > > > code is perf sensitive and java hiding a lot of what is and
> > > isn't
> > > >> a
> > > >> > > file
> > > >> > > > > operation. One part of this is the sanityCheck() call for
> the
> > > two
> > > >> > > > indexes.
> > > >> > > > > I don't think this call reads the full index, just the last
> > > entry
> > > >> in
> > > >> > > the
> > > >> > > > > index, right?. There should be no need to read the full
> index
> > > >> except
> > > >> > > > during
> > > >> > > > > recovery (and then only for the segments being recovered). I
> > > >> think it
> > > >> > > > would
> > > >> > > > > make a ton of sense to optimize this but I don't think that
> > > >> > > optimization
> > > >> > > > > needs to be configurable as this is just a helpful sanity
> > check
> > > to
> > > >> > > detect
> > > >> > > > > common non-sensical things in the index files, but it isn't
> > part
> > > >> of
> > > >> > the
> > > >> > > > > core guarantees, in general you aren't supposed to lose
> > > committed
> > > >> > data
> > > >> > > > from
> > > >> > > > > disk, and if you do we may be able to fail faster but we
> > > >> > fundamentally
> > > >> > > > > can't really help you. Again I think this would make the
> most
> > > >> sense
> > > >> > to
> > > >> > > do
> > > >> > > > > in a data driven way, if you look at that code I think it is
> > > doing
> > > >> > > crazy
> > > >> > > > > amounts of file operations (e.g. getAbsolutePath, file
> sizes,
> > > >> etc). I
> > > >> > > > think
> > > >> > > > > it'd make most sense to profile startup with a cold cash on
> a
> > > >> large
> > > >> > log
> > > >> > > > > directory and do the same with an strace to see how many
> > > redundant
> > > >> > > system
> > > >> > > > > calls we do per segment and what is costing us and then cut
> > some
> > > >> of
> > > >> > > this
> > > >> > > > > out. I suspect we could speed up our startup time quite a
> lot
> > if
> > > >> we
> > > >> > did
> > > >> > > > > that.
> > > >> > > > >
> > > >> > > > > For example we have a bunch of calls like this:
> > > >> > > > >
> > > >> > > > >     require(len % entrySize == 0,
> > > >> > > > >
> > > >> > > > >             "Index file " + file.getAbsolutePath + " is
> > corrupt,
> > > >> > found
> > > >> > > "
> > > >> > > > +
> > > >> > > > > len +
> > > >> > > > >
> > > >> > > > >             " bytes which is not positive or not a multiple
> of
> > > >> 8.")
> > > >> > > > > I'm pretty such file.getAbsolutePath is a system call and I
> > > assume
> > > >> > that
> > > >> > > > > happens whether or not you fail the in-memory check?
> > > >> > > > >
> > > >> > > > > -Jay
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On Sun, Feb 25, 2018 at 10:27 PM, Dong Lin <
> > lindong28@gmail.com
> > > >
> > > >> > > wrote:
> > > >> > > > >
> > > >> > > > > > Hi all,
> > > >> > > > > >
> > > >> > > > > > I have created KIP-263: Allow broker to skip sanity check
> of
> > > >> > inactive
> > > >> > > > > > segments on broker startup. See
> > > >> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > >> > > > > > 263%3A+Allow+broker+to+skip+sanity+check+of+inactive+
> > > >> > > > > > segments+on+broker+startup
> > > >> > > > > > .
> > > >> > > > > >
> > > >> > > > > > This KIP provides a way to significantly reduce time to
> > > rolling
> > > >> > > bounce
> > > >> > > > a
> > > >> > > > > > Kafka cluster.
> > > >> > > > > >
> > > >> > > > > > Comments are welcome!
> > > >> > > > > >
> > > >> > > > > > Thanks,
> > > >> > > > > > Dong
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-263: Allow broker to skip sanity check of inactive segments on broker startup

Posted by Dhruvil Shah <dh...@confluent.io>.

+1 to what Jason said. We need a better long-term strategy for dealing with
corrupted log and index data, but the sanity checks we have do not
guarantee much in this regard.

For now, we could do away with these index sanity checks in my opinion. We
could handle the missing index case at startup. I think we could have
missing index files only when users are upgrading from a version that did
not have a particular type of index to a version that does, or if the
operator physically deleted these files. Because these are rare scenarios,
having to recreate a missing index should typically not affect normal
startup time.

- Dhruvil

On Wed, Jun 27, 2018 at 8:47 AM Jason Gustafson <ja...@confluent.io> wrote:

> Hey Dong,
>
>
> So the main concern with the above approach is that, if for any reason the
> > index files of inactive segment is deleted or corrupted, the broker will
> > halt if there is only one log directory. This is different from the
> > existing behavior where the broker will rebuild the index for this
> inactive
> > segment before it can accept any request from consumer. Though we don't
> > have provide guarantee for segments already flushed to disk, this still
> > seems like a change in behavior for user. Maybe we don't have to worry
> > about this if we decide it is very rare, e.g. it happens only when there
> is
> > disk error or when there is human error.
>
>
> I think we should probably still handle the case when an index file is
> missing during startup? But considering how weak the sanity check is, it
> seems fine to skip it.  Also, could we just make this change without a KIP?
> Adding a config to enable a wimpy sanity check seems unnecessary.
>
> One scenario that does come up with users is actual segment corruption,
> which is only detected by consumers that are validating CRCs. To fix it, we
> have to manually delete the segments and force re-replication. It would be
> helpful to have a config to enable deep checking on startup for particular
> topics or partitions. This could also just be a separate tool though
> ("kafka-fsck" or something).
>
> Thinking longer term, I think we need a more systematic approach to dealing
> with corruption, not just in index files, but in the segments as well. It
> might be nice, for example, if the consumer had a way to hint the broker
> that a particular offset is corrupt. The leader might then demote itself,
> for example, and try to repair. Lots to think through though.
>
> -Jason
>
>
>
>
> On Wed, Jun 27, 2018 at 12:29 AM, Dong Lin <li...@gmail.com> wrote:
>
> > So the main concern with the above approach is that, if for any reason
> the
> > index files of inactive segment is deleted or corrupted, the broker will
> > halt if there is only one log directory. This is different from the
> > existing behavior where the broker will rebuild the index for this
> inactive
> > segment before it can accept any request from consumer. Though we don't
> > have provide guarantee for segments already flushed to disk, this still
> > seems like a change in behavior for user. Maybe we don't have to worry
> > about this if we decide it is very rare, e.g. it happens only when there
> is
> > disk error or when there is human error.
> >
> >
> >
> > On Wed, Jun 27, 2018 at 12:04 AM, Dong Lin <li...@gmail.com> wrote:
> >
> > > Hey Jason,
> > >
> > > Thanks for the comment!
> > >
> > > Your comment reminded me to read through Jay's comments and my reply
> > > again. It seems that I probably have not captured idea of Jay's comment
> > > that says sanity check is not part of any formal guarantee we provide.
> I
> > > probably should have thought about this comment more. Let me reply to
> > both
> > > yours and Jay's comment and see if I can understand you better.
> > >
> > > Here are some clarifications:
> > > - KIP does not intend to optimize recovery. It aims to optimize the the
> > > sanity check when there is clean shutdown.
> > > - Sanity check only read the last entry of the index rather than the
> full
> > > index
> > > - We have already done data driven investigation though it is not done
> > > using hprof or strace. The resulting rolling bounce time is acceptable
> > now.
> > > If it appears to be an issue e.g. after more data then we may need to
> > > revisit this with more data driven investigation
> > >
> > > I agree with the following comments:
> > > - We should optimize the default behavior instead of adding a new
> config.
> > > - sanity check of the segments before recovery offset is not part of
> any
> > > formal guarantee and thus we probably can just skip it.
> > >
> > > So we are all leaning towards skipping the sanity check of all segments
> > > before the recovery offset. This solution would be pretty
> straightforward
> > > to understand and implement. And I am sure it will give us all the
> > benefits
> > > that this KIP intends to achieve. Here is only one question to double
> > check:
> > >
> > > If consumer fetches from an inactive segment, broker will just use the
> > > index of that inactive segment. If anything goes wrong, e.g. the index
> > file
> > > is corrupted or the index file does not exist, then the broker will
> just
> > > consider it as IOException, mark the disk and the partitions on the
> disk
> > > offline and respond KafkaStorageException to consumer. Does this sound
> > OK?
> > > One alternative solution is to let broker rebuild index. But this
> > > alternative solution is inconsistent with the idea that "sanity check
> is
> > not
> > > part of any formal guarantee" and it may tie up all request handler
> > > thread for rebuilding the indexed.
> > >
> > >
> > > If this solution sounds right, I will update the KIP accordingly.
> > >
> > > Thanks,
> > > Dong
> > >
> > > On Tue, Jun 26, 2018 at 3:23 PM, Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > >> Hey Dong,
> > >>
> > >> Sorry for being slow to catch up to this.
> > >>
> > >> I think the benefit of the sanity check seems a little dubious in the
> > >> first
> > >> place. We detect garbage at the end of the index file, but that's
> about
> > >> it.
> > >> Is there any reason to think that corruption is more likely to occur
> > there
> > >> or any other reason to think this check is still beneficial for
> flushed
> > >> data? I assume we did the check because we presumed it was cheap, but
> > >> perhaps the cost is adding up as the number of partitions grows. How
> > much
> > >> does startup time improve if we skip the sanity check for data earlier
> > >> than
> > >> the recovery point? Does the lazy loading itself give some additional
> > >> benefit beyond skipping the sanity check? As Jay mentions above, the
> > >> sanity
> > >> checks seem strictly speaking optional. We don't bother checking the
> > >> segments themselves for example.
> > >>
> > >> Thanks,
> > >> Jason
> > >>
> > >>
> > >>
> > >>
> > >> It probably still makes sense for segments beyond the recovery point
> > >>
> > >> On Wed, Mar 21, 2018 at 9:59 PM, Dong Lin <li...@gmail.com>
> wrote:
> > >>
> > >> > Hey Jay,
> > >> >
> > >> > Yeah our existing sanity check only read the last entry in the index
> > >> files.
> > >> > I must have miscommunicated if I previously said it was reading the
> > full
> > >> > index. Broker appears to be spending a lot of time just to read the
> > last
> > >> > entry of index files for every log segment. This is probably because
> > OS
> > >> > will load a chunk of data that is much larger than the entry itself
> > from
> > >> > disk to page cache. This KIP tries to make this part of operation
> > lazy.
> > >> I
> > >> > guess you are suggesting that we should just make the lazy loading
> the
> > >> > default behavior?
> > >> >
> > >> > Yes we currently require manual intervention if the log file is
> > >> corrupted,
> > >> > i.e. if two messages with the same offset are appended to the disk
> > >> > (KAFKA-6488). The sanity check on broker startup is a bit different
> > >> since
> > >> > it deals with the corruption of index files (e.g. offset index, time
> > >> index
> > >> > and snapshot files) instead of the log data. In this case if index
> > files
> > >> > are corrupted broker will automatically recover it by rebuilding the
> > >> index
> > >> > files using data in the log files, without requiring manual
> > >> intervention.
> > >> > Thus the design question is whether this should be done before
> broker
> > >> can
> > >> > become leader for any partitions -- there is tradeoff between broker
> > >> > startup time and risk of delaying user requests if broker need to
> > >> rebuild
> > >> > index files when it is already leader. I prefer lazy loading to
> reduce
> > >> > broker startup time. Not sure what are the feedback from the
> community
> > >> on
> > >> > this issue.
> > >> >
> > >> > Thanks,
> > >> > Dong
> > >> >
> > >> >
> > >> > On Wed, Mar 21, 2018 at 7:36 AM, Jay Kreps <ja...@confluent.io>
> wrote:
> > >> >
> > >> > > Hey Dong,
> > >> > >
> > >> > > Makes total sense. What I'm saying is I don't think that the
> sanity
> > >> check
> > >> > > is part of any formal guarantee we provide. It is true that
> > >> corruption of
> > >> > > data flushed to disk will be a potential problem, but I don't
> think
> > >> the
> > >> > > sanity check solves that it just has a couple heuristics to help
> > >> detect
> > >> > > certain possible instances of it, right? In general I think our
> > >> > assumption
> > >> > > has been that flushed data doesn't disappear or get corrupted and
> if
> > >> it
> > >> > > does you need to manually intervene. I don't think people want to
> > >> > configure
> > >> > > things at this level so what I was suggesting was understanding
> why
> > >> the
> > >> > > sanity check is slow and trying to avoid that rather than making
> it
> > >> > > configurable. I think you mentioned it was reading the full index
> > into
> > >> > > memory. Based on the performance you describe this could be true,
> > but
> > >> it
> > >> > > definitely should not be reading anything but the last entry in
> the
> > >> index
> > >> > > so that would be a bug. That read also happens in sanityCheck()
> only
> > >> in
> > >> > the
> > >> > > time-based index right? In the offset index we do the same read
> but
> > it
> > >> > > happens in initialization. If that read is the slow thing it might
> > >> make
> > >> > > sense to try to remove it or make it lazy in both cases. If it is
> > some
> > >> > > other part of the code then (e.g. the size check) then that may be
> > >> able
> > >> > to
> > >> > > be avoided entirely (I think by the time we sanity check we
> already
> > >> know
> > >> > > the file size from the mapping...). That was what I meant by doing
> > >> some
> > >> > > data driven analysis. Maybe a quick run with hprof would help
> > >> determine
> > >> > the
> > >> > > root cause of why sanityCheck is slow?
> > >> > >
> > >> > > -Jay
> > >> > >
> > >> > > On Tue, Mar 20, 2018 at 12:13 AM Dong Lin <li...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > > > Hey Jay,
> > >> > > >
> > >> > > > Thanks for your comments!
> > >> > > >
> > >> > > > Yeah recovery is different from the sanity check. They are
> > >> correlated
> > >> > in
> > >> > > > the sense that there may still be corrupted index files even
> after
> > >> > clean
> > >> > > > broker shutdown. And in that case if we delay the sanity check
> > then
> > >> we
> > >> > > may
> > >> > > > delay the log recovery. The main goal of this KIP is to optimize
> > the
> > >> > > sanity
> > >> > > > check related work so that it does not delay the broker startup
> > >> much.
> > >> > > >
> > >> > > > The KIP mentioned that the sanity check is done using log
> recovery
> > >> > > > background thread. The name "recovery" is mentioned mainly
> because
> > >> the
> > >> > > > background thread number is determined using the existing
> > >> > > > config num.recovery.threads.per.data.dir. I have updated the KIP
> > to
> > >> > make
> > >> > > > this less confusing.
> > >> > > >
> > >> > > > It makes a ton of sense to optimize the broker startup time in a
> > >> data
> > >> > > > driven fashion. The currently optimize is done kind of in this
> > >> fashion.
> > >> > > The
> > >> > > > broker log shows that LogManager.loadLogs() takes a long time in
> > >> large
> > >> > > > clusters. Then I started broker with cold cache and repeatedly
> get
> > >> > thread
> > >> > > > dump to see what are broker threads are doing during
> > >> > > LogManager.loadLogs().
> > >> > > > Most of the threads are working on sanityCheck() and this
> > motivates
> > >> the
> > >> > > > change in this KIP. Previously broker shutdown time was
> > investigated
> > >> > in a
> > >> > > > similar data driven fashion and optimized with KAFKA-6172 and
> > >> > KAFKA-6175.
> > >> > > > It seems that the current KIP can reduces the rolling bounce
> time
> > >> of a
> > >> > > > large cluster by 50% -- there may be room for further
> improvement
> > >> but
> > >> > > maybe
> > >> > > > those do not require as big a change (with the caveat described
> in
> > >> the
> > >> > > KIP)
> > >> > > > as suggested in this KIP.
> > >> > > >
> > >> > > > It is not clear whether it is safe to just read the latest
> segment
> > >> > > without
> > >> > > > sanity checking all previous inactive segment of a given
> partition
> > >> if
> > >> > > > transaction is used. Otherwise we probably want to always skip
> the
> > >> > sanity
> > >> > > > check of inactive segments without introducing a new config.
> Maybe
> > >> the
> > >> > > > developers familiar with the transaction can comment on that?
> > >> > > >
> > >> > > > Thanks,
> > >> > > > Dong
> > >> > > >
> > >> > > >
> > >> > > > On Mon, Mar 19, 2018 at 7:21 PM, Jay Kreps <ja...@confluent.io>
> > >> wrote:
> > >> > > >
> > >> > > > > Optimizing startup seems really valuable but I'm a little
> > >> confused by
> > >> > > > this.
> > >> > > > >
> > >> > > > > There are two different things:
> > >> > > > > 1. Recovery
> > >> > > > > 2. Sanity check
> > >> > > > >
> > >> > > > > The terminology we're using is a bit mixed here.
> > >> > > > >
> > >> > > > > Recovery means checksumming the log segments and rebuilding
> the
> > >> index
> > >> > > on
> > >> > > > a
> > >> > > > > hard crash. This only happens on unflushed segments, which is
> > >> > generally
> > >> > > > > just the last segment. Recovery is essential for the
> correctness
> > >> > > > guarantees
> > >> > > > > of the log and you shouldn't disable it. It only happens on
> hard
> > >> > crash
> > >> > > > and
> > >> > > > > is not a factor in graceful restart. We can likely optimize it
> > but
> > >> > that
> > >> > > > > would make most sense to do in a data driven fashion off some
> > >> > > profiling.
> > >> > > > >
> > >> > > > > However there is also a ton of disk activity that happens
> during
> > >> > > > > initialization (lots of checks on the file size, absolute
> path,
> > >> > etc). I
> > >> > > > > think these have crept in over time with people not really
> > >> realizing
> > >> > > this
> > >> > > > > code is perf sensitive and java hiding a lot of what is and
> > isn't
> > >> a
> > >> > > file
> > >> > > > > operation. One part of this is the sanityCheck() call for the
> > two
> > >> > > > indexes.
> > >> > > > > I don't think this call reads the full index, just the last
> > entry
> > >> in
> > >> > > the
> > >> > > > > index, right?. There should be no need to read the full index
> > >> except
> > >> > > > during
> > >> > > > > recovery (and then only for the segments being recovered). I
> > >> think it
> > >> > > > would
> > >> > > > > make a ton of sense to optimize this but I don't think that
> > >> > > optimization
> > >> > > > > needs to be configurable as this is just a helpful sanity
> check
> > to
> > >> > > detect
> > >> > > > > common non-sensical things in the index files, but it isn't
> part
> > >> of
> > >> > the
> > >> > > > > core guarantees, in general you aren't supposed to lose
> > committed
> > >> > data
> > >> > > > from
> > >> > > > > disk, and if you do we may be able to fail faster but we
> > >> > fundamentally
> > >> > > > > can't really help you. Again I think this would make the most
> > >> sense
> > >> > to
> > >> > > do
> > >> > > > > in a data driven way, if you look at that code I think it is
> > doing
> > >> > > crazy
> > >> > > > > amounts of file operations (e.g. getAbsolutePath, file sizes,
> > >> etc). I
> > >> > > > think
> > >> > > > > it'd make most sense to profile startup with a cold cash on a
> > >> large
> > >> > log
> > >> > > > > directory and do the same with an strace to see how many
> > redundant
> > >> > > system
> > >> > > > > calls we do per segment and what is costing us and then cut
> some
> > >> of
> > >> > > this
> > >> > > > > out. I suspect we could speed up our startup time quite a lot
> if
> > >> we
> > >> > did
> > >> > > > > that.
> > >> > > > >
> > >> > > > > For example we have a bunch of calls like this:
> > >> > > > >
> > >> > > > >     require(len % entrySize == 0,
> > >> > > > >
> > >> > > > >             "Index file " + file.getAbsolutePath + " is
> corrupt,
> > >> > found
> > >> > > "
> > >> > > > +
> > >> > > > > len +
> > >> > > > >
> > >> > > > >             " bytes which is not positive or not a multiple of
> > >> 8.")
> > >> > > > > I'm pretty such file.getAbsolutePath is a system call and I
> > assume
> > >> > that
> > >> > > > > happens whether or not you fail the in-memory check?
> > >> > > > >
> > >> > > > > -Jay
> > >> > > > >
> > >> > > > >
> > >> > > > > On Sun, Feb 25, 2018 at 10:27 PM, Dong Lin <
> lindong28@gmail.com
> > >
> > >> > > wrote:
> > >> > > > >
> > >> > > > > > Hi all,
> > >> > > > > >
> > >> > > > > > I have created KIP-263: Allow broker to skip sanity check of
> > >> > inactive
> > >> > > > > > segments on broker startup. See
> > >> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > >> > > > > > 263%3A+Allow+broker+to+skip+sanity+check+of+inactive+
> > >> > > > > > segments+on+broker+startup
> > >> > > > > > .
> > >> > > > > >
> > >> > > > > > This KIP provides a way to significantly reduce time to
> > rolling
> > >> > > bounce
> > >> > > > a
> > >> > > > > > Kafka cluster.
> > >> > > > > >
> > >> > > > > > Comments are welcome!
> > >> > > > > >
> > >> > > > > > Thanks,
> > >> > > > > > Dong
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: [DISCUSS] KIP-263: Allow broker to skip sanity check of inactive segments on broker startup

Posted by Jason Gustafson <ja...@confluent.io>.

Hey Dong,


So the main concern with the above approach is that, if for any reason the
> index files of inactive segment is deleted or corrupted, the broker will
> halt if there is only one log directory. This is different from the
> existing behavior where the broker will rebuild the index for this inactive
> segment before it can accept any request from consumer. Though we don't
> have provide guarantee for segments already flushed to disk, this still
> seems like a change in behavior for user. Maybe we don't have to worry
> about this if we decide it is very rare, e.g. it happens only when there is
> disk error or when there is human error.


I think we should probably still handle the case when an index file is
missing during startup? But considering how weak the sanity check is, it
seems fine to skip it.  Also, could we just make this change without a KIP?
Adding a config to enable a wimpy sanity check seems unnecessary.

One scenario that does come up with users is actual segment corruption,
which is only detected by consumers that are validating CRCs. To fix it, we
have to manually delete the segments and force re-replication. It would be
helpful to have a config to enable deep checking on startup for particular
topics or partitions. This could also just be a separate tool though
("kafka-fsck" or something).

Thinking longer term, I think we need a more systematic approach to dealing
with corruption, not just in index files, but in the segments as well. It
might be nice, for example, if the consumer had a way to hint the broker
that a particular offset is corrupt. The leader might then demote itself,
for example, and try to repair. Lots to think through though.

-Jason




On Wed, Jun 27, 2018 at 12:29 AM, Dong Lin <li...@gmail.com> wrote:

> So the main concern with the above approach is that, if for any reason the
> index files of inactive segment is deleted or corrupted, the broker will
> halt if there is only one log directory. This is different from the
> existing behavior where the broker will rebuild the index for this inactive
> segment before it can accept any request from consumer. Though we don't
> have provide guarantee for segments already flushed to disk, this still
> seems like a change in behavior for user. Maybe we don't have to worry
> about this if we decide it is very rare, e.g. it happens only when there is
> disk error or when there is human error.
>
>
>
> On Wed, Jun 27, 2018 at 12:04 AM, Dong Lin <li...@gmail.com> wrote:
>
> > Hey Jason,
> >
> > Thanks for the comment!
> >
> > Your comment reminded me to read through Jay's comments and my reply
> > again. It seems that I probably have not captured idea of Jay's comment
> > that says sanity check is not part of any formal guarantee we provide. I
> > probably should have thought about this comment more. Let me reply to
> both
> > yours and Jay's comment and see if I can understand you better.
> >
> > Here are some clarifications:
> > - KIP does not intend to optimize recovery. It aims to optimize the the
> > sanity check when there is clean shutdown.
> > - Sanity check only read the last entry of the index rather than the full
> > index
> > - We have already done data driven investigation though it is not done
> > using hprof or strace. The resulting rolling bounce time is acceptable
> now.
> > If it appears to be an issue e.g. after more data then we may need to
> > revisit this with more data driven investigation
> >
> > I agree with the following comments:
> > - We should optimize the default behavior instead of adding a new config.
> > - sanity check of the segments before recovery offset is not part of any
> > formal guarantee and thus we probably can just skip it.
> >
> > So we are all leaning towards skipping the sanity check of all segments
> > before the recovery offset. This solution would be pretty straightforward
> > to understand and implement. And I am sure it will give us all the
> benefits
> > that this KIP intends to achieve. Here is only one question to double
> check:
> >
> > If consumer fetches from an inactive segment, broker will just use the
> > index of that inactive segment. If anything goes wrong, e.g. the index
> file
> > is corrupted or the index file does not exist, then the broker will just
> > consider it as IOException, mark the disk and the partitions on the disk
> > offline and respond KafkaStorageException to consumer. Does this sound
> OK?
> > One alternative solution is to let broker rebuild index. But this
> > alternative solution is inconsistent with the idea that "sanity check is
> not
> > part of any formal guarantee" and it may tie up all request handler
> > thread for rebuilding the indexed.
> >
> >
> > If this solution sounds right, I will update the KIP accordingly.
> >
> > Thanks,
> > Dong
> >
> > On Tue, Jun 26, 2018 at 3:23 PM, Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> >> Hey Dong,
> >>
> >> Sorry for being slow to catch up to this.
> >>
> >> I think the benefit of the sanity check seems a little dubious in the
> >> first
> >> place. We detect garbage at the end of the index file, but that's about
> >> it.
> >> Is there any reason to think that corruption is more likely to occur
> there
> >> or any other reason to think this check is still beneficial for flushed
> >> data? I assume we did the check because we presumed it was cheap, but
> >> perhaps the cost is adding up as the number of partitions grows. How
> much
> >> does startup time improve if we skip the sanity check for data earlier
> >> than
> >> the recovery point? Does the lazy loading itself give some additional
> >> benefit beyond skipping the sanity check? As Jay mentions above, the
> >> sanity
> >> checks seem strictly speaking optional. We don't bother checking the
> >> segments themselves for example.
> >>
> >> Thanks,
> >> Jason
> >>
> >>
> >>
> >>
> >> It probably still makes sense for segments beyond the recovery point
> >>
> >> On Wed, Mar 21, 2018 at 9:59 PM, Dong Lin <li...@gmail.com> wrote:
> >>
> >> > Hey Jay,
> >> >
> >> > Yeah our existing sanity check only read the last entry in the index
> >> files.
> >> > I must have miscommunicated if I previously said it was reading the
> full
> >> > index. Broker appears to be spending a lot of time just to read the
> last
> >> > entry of index files for every log segment. This is probably because
> OS
> >> > will load a chunk of data that is much larger than the entry itself
> from
> >> > disk to page cache. This KIP tries to make this part of operation
> lazy.
> >> I
> >> > guess you are suggesting that we should just make the lazy loading the
> >> > default behavior?
> >> >
> >> > Yes we currently require manual intervention if the log file is
> >> corrupted,
> >> > i.e. if two messages with the same offset are appended to the disk
> >> > (KAFKA-6488). The sanity check on broker startup is a bit different
> >> since
> >> > it deals with the corruption of index files (e.g. offset index, time
> >> index
> >> > and snapshot files) instead of the log data. In this case if index
> files
> >> > are corrupted broker will automatically recover it by rebuilding the
> >> index
> >> > files using data in the log files, without requiring manual
> >> intervention.
> >> > Thus the design question is whether this should be done before broker
> >> can
> >> > become leader for any partitions -- there is tradeoff between broker
> >> > startup time and risk of delaying user requests if broker need to
> >> rebuild
> >> > index files when it is already leader. I prefer lazy loading to reduce
> >> > broker startup time. Not sure what are the feedback from the community
> >> on
> >> > this issue.
> >> >
> >> > Thanks,
> >> > Dong
> >> >
> >> >
> >> > On Wed, Mar 21, 2018 at 7:36 AM, Jay Kreps <ja...@confluent.io> wrote:
> >> >
> >> > > Hey Dong,
> >> > >
> >> > > Makes total sense. What I'm saying is I don't think that the sanity
> >> check
> >> > > is part of any formal guarantee we provide. It is true that
> >> corruption of
> >> > > data flushed to disk will be a potential problem, but I don't think
> >> the
> >> > > sanity check solves that it just has a couple heuristics to help
> >> detect
> >> > > certain possible instances of it, right? In general I think our
> >> > assumption
> >> > > has been that flushed data doesn't disappear or get corrupted and if
> >> it
> >> > > does you need to manually intervene. I don't think people want to
> >> > configure
> >> > > things at this level so what I was suggesting was understanding why
> >> the
> >> > > sanity check is slow and trying to avoid that rather than making it
> >> > > configurable. I think you mentioned it was reading the full index
> into
> >> > > memory. Based on the performance you describe this could be true,
> but
> >> it
> >> > > definitely should not be reading anything but the last entry in the
> >> index
> >> > > so that would be a bug. That read also happens in sanityCheck() only
> >> in
> >> > the
> >> > > time-based index right? In the offset index we do the same read but
> it
> >> > > happens in initialization. If that read is the slow thing it might
> >> make
> >> > > sense to try to remove it or make it lazy in both cases. If it is
> some
> >> > > other part of the code then (e.g. the size check) then that may be
> >> able
> >> > to
> >> > > be avoided entirely (I think by the time we sanity check we already
> >> know
> >> > > the file size from the mapping...). That was what I meant by doing
> >> some
> >> > > data driven analysis. Maybe a quick run with hprof would help
> >> determine
> >> > the
> >> > > root cause of why sanityCheck is slow?
> >> > >
> >> > > -Jay
> >> > >
> >> > > On Tue, Mar 20, 2018 at 12:13 AM Dong Lin <li...@gmail.com>
> >> wrote:
> >> > >
> >> > > > Hey Jay,
> >> > > >
> >> > > > Thanks for your comments!
> >> > > >
> >> > > > Yeah recovery is different from the sanity check. They are
> >> correlated
> >> > in
> >> > > > the sense that there may still be corrupted index files even after
> >> > clean
> >> > > > broker shutdown. And in that case if we delay the sanity check
> then
> >> we
> >> > > may
> >> > > > delay the log recovery. The main goal of this KIP is to optimize
> the
> >> > > sanity
> >> > > > check related work so that it does not delay the broker startup
> >> much.
> >> > > >
> >> > > > The KIP mentioned that the sanity check is done using log recovery
> >> > > > background thread. The name "recovery" is mentioned mainly because
> >> the
> >> > > > background thread number is determined using the existing
> >> > > > config num.recovery.threads.per.data.dir. I have updated the KIP
> to
> >> > make
> >> > > > this less confusing.
> >> > > >
> >> > > > It makes a ton of sense to optimize the broker startup time in a
> >> data
> >> > > > driven fashion. The currently optimize is done kind of in this
> >> fashion.
> >> > > The
> >> > > > broker log shows that LogManager.loadLogs() takes a long time in
> >> large
> >> > > > clusters. Then I started broker with cold cache and repeatedly get
> >> > thread
> >> > > > dump to see what are broker threads are doing during
> >> > > LogManager.loadLogs().
> >> > > > Most of the threads are working on sanityCheck() and this
> motivates
> >> the
> >> > > > change in this KIP. Previously broker shutdown time was
> investigated
> >> > in a
> >> > > > similar data driven fashion and optimized with KAFKA-6172 and
> >> > KAFKA-6175.
> >> > > > It seems that the current KIP can reduces the rolling bounce time
> >> of a
> >> > > > large cluster by 50% -- there may be room for further improvement
> >> but
> >> > > maybe
> >> > > > those do not require as big a change (with the caveat described in
> >> the
> >> > > KIP)
> >> > > > as suggested in this KIP.
> >> > > >
> >> > > > It is not clear whether it is safe to just read the latest segment
> >> > > without
> >> > > > sanity checking all previous inactive segment of a given partition
> >> if
> >> > > > transaction is used. Otherwise we probably want to always skip the
> >> > sanity
> >> > > > check of inactive segments without introducing a new config. Maybe
> >> the
> >> > > > developers familiar with the transaction can comment on that?
> >> > > >
> >> > > > Thanks,
> >> > > > Dong
> >> > > >
> >> > > >
> >> > > > On Mon, Mar 19, 2018 at 7:21 PM, Jay Kreps <ja...@confluent.io>
> >> wrote:
> >> > > >
> >> > > > > Optimizing startup seems really valuable but I'm a little
> >> confused by
> >> > > > this.
> >> > > > >
> >> > > > > There are two different things:
> >> > > > > 1. Recovery
> >> > > > > 2. Sanity check
> >> > > > >
> >> > > > > The terminology we're using is a bit mixed here.
> >> > > > >
> >> > > > > Recovery means checksumming the log segments and rebuilding the
> >> index
> >> > > on
> >> > > > a
> >> > > > > hard crash. This only happens on unflushed segments, which is
> >> > generally
> >> > > > > just the last segment. Recovery is essential for the correctness
> >> > > > guarantees
> >> > > > > of the log and you shouldn't disable it. It only happens on hard
> >> > crash
> >> > > > and
> >> > > > > is not a factor in graceful restart. We can likely optimize it
> but
> >> > that
> >> > > > > would make most sense to do in a data driven fashion off some
> >> > > profiling.
> >> > > > >
> >> > > > > However there is also a ton of disk activity that happens during
> >> > > > > initialization (lots of checks on the file size, absolute path,
> >> > etc). I
> >> > > > > think these have crept in over time with people not really
> >> realizing
> >> > > this
> >> > > > > code is perf sensitive and java hiding a lot of what is and
> isn't
> >> a
> >> > > file
> >> > > > > operation. One part of this is the sanityCheck() call for the
> two
> >> > > > indexes.
> >> > > > > I don't think this call reads the full index, just the last
> entry
> >> in
> >> > > the
> >> > > > > index, right?. There should be no need to read the full index
> >> except
> >> > > > during
> >> > > > > recovery (and then only for the segments being recovered). I
> >> think it
> >> > > > would
> >> > > > > make a ton of sense to optimize this but I don't think that
> >> > > optimization
> >> > > > > needs to be configurable as this is just a helpful sanity check
> to
> >> > > detect
> >> > > > > common non-sensical things in the index files, but it isn't part
> >> of
> >> > the
> >> > > > > core guarantees, in general you aren't supposed to lose
> committed
> >> > data
> >> > > > from
> >> > > > > disk, and if you do we may be able to fail faster but we
> >> > fundamentally
> >> > > > > can't really help you. Again I think this would make the most
> >> sense
> >> > to
> >> > > do
> >> > > > > in a data driven way, if you look at that code I think it is
> doing
> >> > > crazy
> >> > > > > amounts of file operations (e.g. getAbsolutePath, file sizes,
> >> etc). I
> >> > > > think
> >> > > > > it'd make most sense to profile startup with a cold cash on a
> >> large
> >> > log
> >> > > > > directory and do the same with an strace to see how many
> redundant
> >> > > system
> >> > > > > calls we do per segment and what is costing us and then cut some
> >> of
> >> > > this
> >> > > > > out. I suspect we could speed up our startup time quite a lot if
> >> we
> >> > did
> >> > > > > that.
> >> > > > >
> >> > > > > For example we have a bunch of calls like this:
> >> > > > >
> >> > > > >     require(len % entrySize == 0,
> >> > > > >
> >> > > > >             "Index file " + file.getAbsolutePath + " is corrupt,
> >> > found
> >> > > "
> >> > > > +
> >> > > > > len +
> >> > > > >
> >> > > > >             " bytes which is not positive or not a multiple of
> >> 8.")
> >> > > > > I'm pretty such file.getAbsolutePath is a system call and I
> assume
> >> > that
> >> > > > > happens whether or not you fail the in-memory check?
> >> > > > >
> >> > > > > -Jay
> >> > > > >
> >> > > > >
> >> > > > > On Sun, Feb 25, 2018 at 10:27 PM, Dong Lin <lindong28@gmail.com
> >
> >> > > wrote:
> >> > > > >
> >> > > > > > Hi all,
> >> > > > > >
> >> > > > > > I have created KIP-263: Allow broker to skip sanity check of
> >> > inactive
> >> > > > > > segments on broker startup. See
> >> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >> > > > > > 263%3A+Allow+broker+to+skip+sanity+check+of+inactive+
> >> > > > > > segments+on+broker+startup
> >> > > > > > .
> >> > > > > >
> >> > > > > > This KIP provides a way to significantly reduce time to
> rolling
> >> > > bounce
> >> > > > a
> >> > > > > > Kafka cluster.
> >> > > > > >
> >> > > > > > Comments are welcome!
> >> > > > > >
> >> > > > > > Thanks,
> >> > > > > > Dong
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: [DISCUSS] KIP-263: Allow broker to skip sanity check of inactive segments on broker startup

Posted by Dong Lin <li...@gmail.com>.

So the main concern with the above approach is that, if for any reason the
index files of inactive segment is deleted or corrupted, the broker will
halt if there is only one log directory. This is different from the
existing behavior where the broker will rebuild the index for this inactive
segment before it can accept any request from consumer. Though we don't
have provide guarantee for segments already flushed to disk, this still
seems like a change in behavior for user. Maybe we don't have to worry
about this if we decide it is very rare, e.g. it happens only when there is
disk error or when there is human error.



On Wed, Jun 27, 2018 at 12:04 AM, Dong Lin <li...@gmail.com> wrote:

> Hey Jason,
>
> Thanks for the comment!
>
> Your comment reminded me to read through Jay's comments and my reply
> again. It seems that I probably have not captured idea of Jay's comment
> that says sanity check is not part of any formal guarantee we provide. I
> probably should have thought about this comment more. Let me reply to both
> yours and Jay's comment and see if I can understand you better.
>
> Here are some clarifications:
> - KIP does not intend to optimize recovery. It aims to optimize the the
> sanity check when there is clean shutdown.
> - Sanity check only read the last entry of the index rather than the full
> index
> - We have already done data driven investigation though it is not done
> using hprof or strace. The resulting rolling bounce time is acceptable now.
> If it appears to be an issue e.g. after more data then we may need to
> revisit this with more data driven investigation
>
> I agree with the following comments:
> - We should optimize the default behavior instead of adding a new config.
> - sanity check of the segments before recovery offset is not part of any
> formal guarantee and thus we probably can just skip it.
>
> So we are all leaning towards skipping the sanity check of all segments
> before the recovery offset. This solution would be pretty straightforward
> to understand and implement. And I am sure it will give us all the benefits
> that this KIP intends to achieve. Here is only one question to double check:
>
> If consumer fetches from an inactive segment, broker will just use the
> index of that inactive segment. If anything goes wrong, e.g. the index file
> is corrupted or the index file does not exist, then the broker will just
> consider it as IOException, mark the disk and the partitions on the disk
> offline and respond KafkaStorageException to consumer. Does this sound OK?
> One alternative solution is to let broker rebuild index. But this
> alternative solution is inconsistent with the idea that "sanity check is not
> part of any formal guarantee" and it may tie up all request handler
> thread for rebuilding the indexed.
>
>
> If this solution sounds right, I will update the KIP accordingly.
>
> Thanks,
> Dong
>
> On Tue, Jun 26, 2018 at 3:23 PM, Jason Gustafson <ja...@confluent.io>
> wrote:
>
>> Hey Dong,
>>
>> Sorry for being slow to catch up to this.
>>
>> I think the benefit of the sanity check seems a little dubious in the
>> first
>> place. We detect garbage at the end of the index file, but that's about
>> it.
>> Is there any reason to think that corruption is more likely to occur there
>> or any other reason to think this check is still beneficial for flushed
>> data? I assume we did the check because we presumed it was cheap, but
>> perhaps the cost is adding up as the number of partitions grows. How much
>> does startup time improve if we skip the sanity check for data earlier
>> than
>> the recovery point? Does the lazy loading itself give some additional
>> benefit beyond skipping the sanity check? As Jay mentions above, the
>> sanity
>> checks seem strictly speaking optional. We don't bother checking the
>> segments themselves for example.
>>
>> Thanks,
>> Jason
>>
>>
>>
>>
>> It probably still makes sense for segments beyond the recovery point
>>
>> On Wed, Mar 21, 2018 at 9:59 PM, Dong Lin <li...@gmail.com> wrote:
>>
>> > Hey Jay,
>> >
>> > Yeah our existing sanity check only read the last entry in the index
>> files.
>> > I must have miscommunicated if I previously said it was reading the full
>> > index. Broker appears to be spending a lot of time just to read the last
>> > entry of index files for every log segment. This is probably because OS
>> > will load a chunk of data that is much larger than the entry itself from
>> > disk to page cache. This KIP tries to make this part of operation lazy.
>> I
>> > guess you are suggesting that we should just make the lazy loading the
>> > default behavior?
>> >
>> > Yes we currently require manual intervention if the log file is
>> corrupted,
>> > i.e. if two messages with the same offset are appended to the disk
>> > (KAFKA-6488). The sanity check on broker startup is a bit different
>> since
>> > it deals with the corruption of index files (e.g. offset index, time
>> index
>> > and snapshot files) instead of the log data. In this case if index files
>> > are corrupted broker will automatically recover it by rebuilding the
>> index
>> > files using data in the log files, without requiring manual
>> intervention.
>> > Thus the design question is whether this should be done before broker
>> can
>> > become leader for any partitions -- there is tradeoff between broker
>> > startup time and risk of delaying user requests if broker need to
>> rebuild
>> > index files when it is already leader. I prefer lazy loading to reduce
>> > broker startup time. Not sure what are the feedback from the community
>> on
>> > this issue.
>> >
>> > Thanks,
>> > Dong
>> >
>> >
>> > On Wed, Mar 21, 2018 at 7:36 AM, Jay Kreps <ja...@confluent.io> wrote:
>> >
>> > > Hey Dong,
>> > >
>> > > Makes total sense. What I'm saying is I don't think that the sanity
>> check
>> > > is part of any formal guarantee we provide. It is true that
>> corruption of
>> > > data flushed to disk will be a potential problem, but I don't think
>> the
>> > > sanity check solves that it just has a couple heuristics to help
>> detect
>> > > certain possible instances of it, right? In general I think our
>> > assumption
>> > > has been that flushed data doesn't disappear or get corrupted and if
>> it
>> > > does you need to manually intervene. I don't think people want to
>> > configure
>> > > things at this level so what I was suggesting was understanding why
>> the
>> > > sanity check is slow and trying to avoid that rather than making it
>> > > configurable. I think you mentioned it was reading the full index into
>> > > memory. Based on the performance you describe this could be true, but
>> it
>> > > definitely should not be reading anything but the last entry in the
>> index
>> > > so that would be a bug. That read also happens in sanityCheck() only
>> in
>> > the
>> > > time-based index right? In the offset index we do the same read but it
>> > > happens in initialization. If that read is the slow thing it might
>> make
>> > > sense to try to remove it or make it lazy in both cases. If it is some
>> > > other part of the code then (e.g. the size check) then that may be
>> able
>> > to
>> > > be avoided entirely (I think by the time we sanity check we already
>> know
>> > > the file size from the mapping...). That was what I meant by doing
>> some
>> > > data driven analysis. Maybe a quick run with hprof would help
>> determine
>> > the
>> > > root cause of why sanityCheck is slow?
>> > >
>> > > -Jay
>> > >
>> > > On Tue, Mar 20, 2018 at 12:13 AM Dong Lin <li...@gmail.com>
>> wrote:
>> > >
>> > > > Hey Jay,
>> > > >
>> > > > Thanks for your comments!
>> > > >
>> > > > Yeah recovery is different from the sanity check. They are
>> correlated
>> > in
>> > > > the sense that there may still be corrupted index files even after
>> > clean
>> > > > broker shutdown. And in that case if we delay the sanity check then
>> we
>> > > may
>> > > > delay the log recovery. The main goal of this KIP is to optimize the
>> > > sanity
>> > > > check related work so that it does not delay the broker startup
>> much.
>> > > >
>> > > > The KIP mentioned that the sanity check is done using log recovery
>> > > > background thread. The name "recovery" is mentioned mainly because
>> the
>> > > > background thread number is determined using the existing
>> > > > config num.recovery.threads.per.data.dir. I have updated the KIP to
>> > make
>> > > > this less confusing.
>> > > >
>> > > > It makes a ton of sense to optimize the broker startup time in a
>> data
>> > > > driven fashion. The currently optimize is done kind of in this
>> fashion.
>> > > The
>> > > > broker log shows that LogManager.loadLogs() takes a long time in
>> large
>> > > > clusters. Then I started broker with cold cache and repeatedly get
>> > thread
>> > > > dump to see what are broker threads are doing during
>> > > LogManager.loadLogs().
>> > > > Most of the threads are working on sanityCheck() and this motivates
>> the
>> > > > change in this KIP. Previously broker shutdown time was investigated
>> > in a
>> > > > similar data driven fashion and optimized with KAFKA-6172 and
>> > KAFKA-6175.
>> > > > It seems that the current KIP can reduces the rolling bounce time
>> of a
>> > > > large cluster by 50% -- there may be room for further improvement
>> but
>> > > maybe
>> > > > those do not require as big a change (with the caveat described in
>> the
>> > > KIP)
>> > > > as suggested in this KIP.
>> > > >
>> > > > It is not clear whether it is safe to just read the latest segment
>> > > without
>> > > > sanity checking all previous inactive segment of a given partition
>> if
>> > > > transaction is used. Otherwise we probably want to always skip the
>> > sanity
>> > > > check of inactive segments without introducing a new config. Maybe
>> the
>> > > > developers familiar with the transaction can comment on that?
>> > > >
>> > > > Thanks,
>> > > > Dong
>> > > >
>> > > >
>> > > > On Mon, Mar 19, 2018 at 7:21 PM, Jay Kreps <ja...@confluent.io>
>> wrote:
>> > > >
>> > > > > Optimizing startup seems really valuable but I'm a little
>> confused by
>> > > > this.
>> > > > >
>> > > > > There are two different things:
>> > > > > 1. Recovery
>> > > > > 2. Sanity check
>> > > > >
>> > > > > The terminology we're using is a bit mixed here.
>> > > > >
>> > > > > Recovery means checksumming the log segments and rebuilding the
>> index
>> > > on
>> > > > a
>> > > > > hard crash. This only happens on unflushed segments, which is
>> > generally
>> > > > > just the last segment. Recovery is essential for the correctness
>> > > > guarantees
>> > > > > of the log and you shouldn't disable it. It only happens on hard
>> > crash
>> > > > and
>> > > > > is not a factor in graceful restart. We can likely optimize it but
>> > that
>> > > > > would make most sense to do in a data driven fashion off some
>> > > profiling.
>> > > > >
>> > > > > However there is also a ton of disk activity that happens during
>> > > > > initialization (lots of checks on the file size, absolute path,
>> > etc). I
>> > > > > think these have crept in over time with people not really
>> realizing
>> > > this
>> > > > > code is perf sensitive and java hiding a lot of what is and isn't
>> a
>> > > file
>> > > > > operation. One part of this is the sanityCheck() call for the two
>> > > > indexes.
>> > > > > I don't think this call reads the full index, just the last entry
>> in
>> > > the
>> > > > > index, right?. There should be no need to read the full index
>> except
>> > > > during
>> > > > > recovery (and then only for the segments being recovered). I
>> think it
>> > > > would
>> > > > > make a ton of sense to optimize this but I don't think that
>> > > optimization
>> > > > > needs to be configurable as this is just a helpful sanity check to
>> > > detect
>> > > > > common non-sensical things in the index files, but it isn't part
>> of
>> > the
>> > > > > core guarantees, in general you aren't supposed to lose committed
>> > data
>> > > > from
>> > > > > disk, and if you do we may be able to fail faster but we
>> > fundamentally
>> > > > > can't really help you. Again I think this would make the most
>> sense
>> > to
>> > > do
>> > > > > in a data driven way, if you look at that code I think it is doing
>> > > crazy
>> > > > > amounts of file operations (e.g. getAbsolutePath, file sizes,
>> etc). I
>> > > > think
>> > > > > it'd make most sense to profile startup with a cold cash on a
>> large
>> > log
>> > > > > directory and do the same with an strace to see how many redundant
>> > > system
>> > > > > calls we do per segment and what is costing us and then cut some
>> of
>> > > this
>> > > > > out. I suspect we could speed up our startup time quite a lot if
>> we
>> > did
>> > > > > that.
>> > > > >
>> > > > > For example we have a bunch of calls like this:
>> > > > >
>> > > > >     require(len % entrySize == 0,
>> > > > >
>> > > > >             "Index file " + file.getAbsolutePath + " is corrupt,
>> > found
>> > > "
>> > > > +
>> > > > > len +
>> > > > >
>> > > > >             " bytes which is not positive or not a multiple of
>> 8.")
>> > > > > I'm pretty such file.getAbsolutePath is a system call and I assume
>> > that
>> > > > > happens whether or not you fail the in-memory check?
>> > > > >
>> > > > > -Jay
>> > > > >
>> > > > >
>> > > > > On Sun, Feb 25, 2018 at 10:27 PM, Dong Lin <li...@gmail.com>
>> > > wrote:
>> > > > >
>> > > > > > Hi all,
>> > > > > >
>> > > > > > I have created KIP-263: Allow broker to skip sanity check of
>> > inactive
>> > > > > > segments on broker startup. See
>> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> > > > > > 263%3A+Allow+broker+to+skip+sanity+check+of+inactive+
>> > > > > > segments+on+broker+startup
>> > > > > > .
>> > > > > >
>> > > > > > This KIP provides a way to significantly reduce time to rolling
>> > > bounce
>> > > > a
>> > > > > > Kafka cluster.
>> > > > > >
>> > > > > > Comments are welcome!
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Dong
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [DISCUSS] KIP-263: Allow broker to skip sanity check of inactive segments on broker startup

Posted by Dong Lin <li...@gmail.com>.

Hey Jason,

Thanks for the comment!

Your comment reminded me to read through Jay's comments and my reply again.
It seems that I probably have not captured idea of Jay's comment that says
sanity check is not part of any formal guarantee we provide. I probably
should have thought about this comment more. Let me reply to both yours and
Jay's comment and see if I can understand you better.

Here are some clarifications:
- KIP does not intend to optimize recovery. It aims to optimize the the
sanity check when there is clean shutdown.
- Sanity check only read the last entry of the index rather than the full
index
- We have already done data driven investigation though it is not done
using hprof or strace. The resulting rolling bounce time is acceptable now.
If it appears to be an issue e.g. after more data then we may need to
revisit this with more data driven investigation

I agree with the following comments:
- We should optimize the default behavior instead of adding a new config.
- sanity check of the segments before recovery offset is not part of any
formal guarantee and thus we probably can just skip it.

So we are all leaning towards skipping the sanity check of all segments
before the recovery offset. This solution would be pretty straightforward
to understand and implement. And I am sure it will give us all the benefits
that this KIP intends to achieve. Here is only one question to double check:

If consumer fetches from an inactive segment, broker will just use the
index of that inactive segment. If anything goes wrong, e.g. the index file
is corrupted or the index file does not exist, then the broker will just
consider it as IOException, mark the disk and the partitions on the disk
offline and respond KafkaStorageException to consumer. Does this sound OK?
One alternative solution is to let broker rebuild index. But this
alternative solution is inconsistent with the idea that "sanity check is not
part of any formal guarantee" and it may tie up all request handler thread
for rebuilding the indexed.


If this solution sounds right, I will update the KIP accordingly.

Thanks,
Dong

On Tue, Jun 26, 2018 at 3:23 PM, Jason Gustafson <ja...@confluent.io> wrote:

> Hey Dong,
>
> Sorry for being slow to catch up to this.
>
> I think the benefit of the sanity check seems a little dubious in the first
> place. We detect garbage at the end of the index file, but that's about it.
> Is there any reason to think that corruption is more likely to occur there
> or any other reason to think this check is still beneficial for flushed
> data? I assume we did the check because we presumed it was cheap, but
> perhaps the cost is adding up as the number of partitions grows. How much
> does startup time improve if we skip the sanity check for data earlier than
> the recovery point? Does the lazy loading itself give some additional
> benefit beyond skipping the sanity check? As Jay mentions above, the sanity
> checks seem strictly speaking optional. We don't bother checking the
> segments themselves for example.
>
> Thanks,
> Jason
>
>
>
>
> It probably still makes sense for segments beyond the recovery point
>
> On Wed, Mar 21, 2018 at 9:59 PM, Dong Lin <li...@gmail.com> wrote:
>
> > Hey Jay,
> >
> > Yeah our existing sanity check only read the last entry in the index
> files.
> > I must have miscommunicated if I previously said it was reading the full
> > index. Broker appears to be spending a lot of time just to read the last
> > entry of index files for every log segment. This is probably because OS
> > will load a chunk of data that is much larger than the entry itself from
> > disk to page cache. This KIP tries to make this part of operation lazy. I
> > guess you are suggesting that we should just make the lazy loading the
> > default behavior?
> >
> > Yes we currently require manual intervention if the log file is
> corrupted,
> > i.e. if two messages with the same offset are appended to the disk
> > (KAFKA-6488). The sanity check on broker startup is a bit different since
> > it deals with the corruption of index files (e.g. offset index, time
> index
> > and snapshot files) instead of the log data. In this case if index files
> > are corrupted broker will automatically recover it by rebuilding the
> index
> > files using data in the log files, without requiring manual intervention.
> > Thus the design question is whether this should be done before broker can
> > become leader for any partitions -- there is tradeoff between broker
> > startup time and risk of delaying user requests if broker need to rebuild
> > index files when it is already leader. I prefer lazy loading to reduce
> > broker startup time. Not sure what are the feedback from the community on
> > this issue.
> >
> > Thanks,
> > Dong
> >
> >
> > On Wed, Mar 21, 2018 at 7:36 AM, Jay Kreps <ja...@confluent.io> wrote:
> >
> > > Hey Dong,
> > >
> > > Makes total sense. What I'm saying is I don't think that the sanity
> check
> > > is part of any formal guarantee we provide. It is true that corruption
> of
> > > data flushed to disk will be a potential problem, but I don't think the
> > > sanity check solves that it just has a couple heuristics to help detect
> > > certain possible instances of it, right? In general I think our
> > assumption
> > > has been that flushed data doesn't disappear or get corrupted and if it
> > > does you need to manually intervene. I don't think people want to
> > configure
> > > things at this level so what I was suggesting was understanding why the
> > > sanity check is slow and trying to avoid that rather than making it
> > > configurable. I think you mentioned it was reading the full index into
> > > memory. Based on the performance you describe this could be true, but
> it
> > > definitely should not be reading anything but the last entry in the
> index
> > > so that would be a bug. That read also happens in sanityCheck() only in
> > the
> > > time-based index right? In the offset index we do the same read but it
> > > happens in initialization. If that read is the slow thing it might make
> > > sense to try to remove it or make it lazy in both cases. If it is some
> > > other part of the code then (e.g. the size check) then that may be able
> > to
> > > be avoided entirely (I think by the time we sanity check we already
> know
> > > the file size from the mapping...). That was what I meant by doing some
> > > data driven analysis. Maybe a quick run with hprof would help determine
> > the
> > > root cause of why sanityCheck is slow?
> > >
> > > -Jay
> > >
> > > On Tue, Mar 20, 2018 at 12:13 AM Dong Lin <li...@gmail.com> wrote:
> > >
> > > > Hey Jay,
> > > >
> > > > Thanks for your comments!
> > > >
> > > > Yeah recovery is different from the sanity check. They are correlated
> > in
> > > > the sense that there may still be corrupted index files even after
> > clean
> > > > broker shutdown. And in that case if we delay the sanity check then
> we
> > > may
> > > > delay the log recovery. The main goal of this KIP is to optimize the
> > > sanity
> > > > check related work so that it does not delay the broker startup much.
> > > >
> > > > The KIP mentioned that the sanity check is done using log recovery
> > > > background thread. The name "recovery" is mentioned mainly because
> the
> > > > background thread number is determined using the existing
> > > > config num.recovery.threads.per.data.dir. I have updated the KIP to
> > make
> > > > this less confusing.
> > > >
> > > > It makes a ton of sense to optimize the broker startup time in a data
> > > > driven fashion. The currently optimize is done kind of in this
> fashion.
> > > The
> > > > broker log shows that LogManager.loadLogs() takes a long time in
> large
> > > > clusters. Then I started broker with cold cache and repeatedly get
> > thread
> > > > dump to see what are broker threads are doing during
> > > LogManager.loadLogs().
> > > > Most of the threads are working on sanityCheck() and this motivates
> the
> > > > change in this KIP. Previously broker shutdown time was investigated
> > in a
> > > > similar data driven fashion and optimized with KAFKA-6172 and
> > KAFKA-6175.
> > > > It seems that the current KIP can reduces the rolling bounce time of
> a
> > > > large cluster by 50% -- there may be room for further improvement but
> > > maybe
> > > > those do not require as big a change (with the caveat described in
> the
> > > KIP)
> > > > as suggested in this KIP.
> > > >
> > > > It is not clear whether it is safe to just read the latest segment
> > > without
> > > > sanity checking all previous inactive segment of a given partition if
> > > > transaction is used. Otherwise we probably want to always skip the
> > sanity
> > > > check of inactive segments without introducing a new config. Maybe
> the
> > > > developers familiar with the transaction can comment on that?
> > > >
> > > > Thanks,
> > > > Dong
> > > >
> > > >
> > > > On Mon, Mar 19, 2018 at 7:21 PM, Jay Kreps <ja...@confluent.io> wrote:
> > > >
> > > > > Optimizing startup seems really valuable but I'm a little confused
> by
> > > > this.
> > > > >
> > > > > There are two different things:
> > > > > 1. Recovery
> > > > > 2. Sanity check
> > > > >
> > > > > The terminology we're using is a bit mixed here.
> > > > >
> > > > > Recovery means checksumming the log segments and rebuilding the
> index
> > > on
> > > > a
> > > > > hard crash. This only happens on unflushed segments, which is
> > generally
> > > > > just the last segment. Recovery is essential for the correctness
> > > > guarantees
> > > > > of the log and you shouldn't disable it. It only happens on hard
> > crash
> > > > and
> > > > > is not a factor in graceful restart. We can likely optimize it but
> > that
> > > > > would make most sense to do in a data driven fashion off some
> > > profiling.
> > > > >
> > > > > However there is also a ton of disk activity that happens during
> > > > > initialization (lots of checks on the file size, absolute path,
> > etc). I
> > > > > think these have crept in over time with people not really
> realizing
> > > this
> > > > > code is perf sensitive and java hiding a lot of what is and isn't a
> > > file
> > > > > operation. One part of this is the sanityCheck() call for the two
> > > > indexes.
> > > > > I don't think this call reads the full index, just the last entry
> in
> > > the
> > > > > index, right?. There should be no need to read the full index
> except
> > > > during
> > > > > recovery (and then only for the segments being recovered). I think
> it
> > > > would
> > > > > make a ton of sense to optimize this but I don't think that
> > > optimization
> > > > > needs to be configurable as this is just a helpful sanity check to
> > > detect
> > > > > common non-sensical things in the index files, but it isn't part of
> > the
> > > > > core guarantees, in general you aren't supposed to lose committed
> > data
> > > > from
> > > > > disk, and if you do we may be able to fail faster but we
> > fundamentally
> > > > > can't really help you. Again I think this would make the most sense
> > to
> > > do
> > > > > in a data driven way, if you look at that code I think it is doing
> > > crazy
> > > > > amounts of file operations (e.g. getAbsolutePath, file sizes,
> etc). I
> > > > think
> > > > > it'd make most sense to profile startup with a cold cash on a large
> > log
> > > > > directory and do the same with an strace to see how many redundant
> > > system
> > > > > calls we do per segment and what is costing us and then cut some of
> > > this
> > > > > out. I suspect we could speed up our startup time quite a lot if we
> > did
> > > > > that.
> > > > >
> > > > > For example we have a bunch of calls like this:
> > > > >
> > > > >     require(len % entrySize == 0,
> > > > >
> > > > >             "Index file " + file.getAbsolutePath + " is corrupt,
> > found
> > > "
> > > > +
> > > > > len +
> > > > >
> > > > >             " bytes which is not positive or not a multiple of 8.")
> > > > > I'm pretty such file.getAbsolutePath is a system call and I assume
> > that
> > > > > happens whether or not you fail the in-memory check?
> > > > >
> > > > > -Jay
> > > > >
> > > > >
> > > > > On Sun, Feb 25, 2018 at 10:27 PM, Dong Lin <li...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I have created KIP-263: Allow broker to skip sanity check of
> > inactive
> > > > > > segments on broker startup. See
> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > > 263%3A+Allow+broker+to+skip+sanity+check+of+inactive+
> > > > > > segments+on+broker+startup
> > > > > > .
> > > > > >
> > > > > > This KIP provides a way to significantly reduce time to rolling
> > > bounce
> > > > a
> > > > > > Kafka cluster.
> > > > > >
> > > > > > Comments are welcome!
> > > > > >
> > > > > > Thanks,
> > > > > > Dong
> > > > > >
> > > > >
> > > >
> > >
> >
>