You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Jason Rosenberg <jb...@squareup.com> on 2013/05/07 00:07:13 UTC

slow log recovery

Recently, we had an issue where our kafka brokers were shut down hard (and
so did not write out the clean shutdown file).  Thus on restart, it went
through all logs and ran a recovery on them.

Unfortunately, this took a long time (on the order of 30 minutes).  We have
a lot of topics (e.g. ~1000 or so).  Is there anyway this can be done more
quickly, say in parallel?

Also, it be done as a background process, so the server can start up and
start receiving messages, logs for incoming topics are prioritized in the
recovery process, and perhaps messages can still be buffered in memory
while the log recovery is happening?

It seems onerous to block all activity for 30 minutes while a slow, serial,
recovery job happens....

Thoughts?

Jason

Re: slow log recovery

Posted by Jason Rosenberg <jb...@squareup.com>.
That will work Jun (I guess it's not different than the current situation
with 0.7.x).

(And I still think it should be possible to recover logs in parallel!).

Jason


On Tue, May 7, 2013 at 7:55 AM, Jun Rao <ju...@gmail.com> wrote:

> If a broker is down, the cluster will be running in under replicated mode,
> ie, data will be written to fewer replicas. When the broker comes back, it
> will catch up data from the current leader.
>
> Thanks,
>
> Jun
>
>
> On Mon, May 6, 2013 at 10:23 PM, Jason Rosenberg <jb...@squareup.com> wrote:
>
> > Will producers also be able to start sending new messages to a replica,
> > while one broker is taking a long time to startup?
> >
> >
> > On Mon, May 6, 2013 at 9:31 PM, Jun Rao <ju...@gmail.com> wrote:
> >
> > > In 0.8, if you turn on replication, it may not matter too much if a
> > broker
> > > takes long to start up since data can still be served from the
> replicas.
> > It
> > > may be possible to improve this by maintaining a flush checkpoint file
> on
> > > disk. We can then use that info to reduce the amount of the data to be
> > > recovered.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > > On Mon, May 6, 2013 at 3:07 PM, Jason Rosenberg <jb...@squareup.com>
> > wrote:
> > >
> > > > Recently, we had an issue where our kafka brokers were shut down hard
> > > (and
> > > > so did not write out the clean shutdown file).  Thus on restart, it
> > went
> > > > through all logs and ran a recovery on them.
> > > >
> > > > Unfortunately, this took a long time (on the order of 30 minutes).
>  We
> > > have
> > > > a lot of topics (e.g. ~1000 or so).  Is there anyway this can be done
> > > more
> > > > quickly, say in parallel?
> > > >
> > > > Also, it be done as a background process, so the server can start up
> > and
> > > > start receiving messages, logs for incoming topics are prioritized in
> > the
> > > > recovery process, and perhaps messages can still be buffered in
> memory
> > > > while the log recovery is happening?
> > > >
> > > > It seems onerous to block all activity for 30 minutes while a slow,
> > > serial,
> > > > recovery job happens....
> > > >
> > > > Thoughts?
> > > >
> > > > Jason
> > > >
> > >
> >
>

Re: slow log recovery

Posted by Jun Rao <ju...@gmail.com>.
If a broker is down, the cluster will be running in under replicated mode,
ie, data will be written to fewer replicas. When the broker comes back, it
will catch up data from the current leader.

Thanks,

Jun


On Mon, May 6, 2013 at 10:23 PM, Jason Rosenberg <jb...@squareup.com> wrote:

> Will producers also be able to start sending new messages to a replica,
> while one broker is taking a long time to startup?
>
>
> On Mon, May 6, 2013 at 9:31 PM, Jun Rao <ju...@gmail.com> wrote:
>
> > In 0.8, if you turn on replication, it may not matter too much if a
> broker
> > takes long to start up since data can still be served from the replicas.
> It
> > may be possible to improve this by maintaining a flush checkpoint file on
> > disk. We can then use that info to reduce the amount of the data to be
> > recovered.
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Mon, May 6, 2013 at 3:07 PM, Jason Rosenberg <jb...@squareup.com>
> wrote:
> >
> > > Recently, we had an issue where our kafka brokers were shut down hard
> > (and
> > > so did not write out the clean shutdown file).  Thus on restart, it
> went
> > > through all logs and ran a recovery on them.
> > >
> > > Unfortunately, this took a long time (on the order of 30 minutes).  We
> > have
> > > a lot of topics (e.g. ~1000 or so).  Is there anyway this can be done
> > more
> > > quickly, say in parallel?
> > >
> > > Also, it be done as a background process, so the server can start up
> and
> > > start receiving messages, logs for incoming topics are prioritized in
> the
> > > recovery process, and perhaps messages can still be buffered in memory
> > > while the log recovery is happening?
> > >
> > > It seems onerous to block all activity for 30 minutes while a slow,
> > serial,
> > > recovery job happens....
> > >
> > > Thoughts?
> > >
> > > Jason
> > >
> >
>

Re: slow log recovery

Posted by Jason Rosenberg <jb...@squareup.com>.
Will producers also be able to start sending new messages to a replica,
while one broker is taking a long time to startup?


On Mon, May 6, 2013 at 9:31 PM, Jun Rao <ju...@gmail.com> wrote:

> In 0.8, if you turn on replication, it may not matter too much if a broker
> takes long to start up since data can still be served from the replicas. It
> may be possible to improve this by maintaining a flush checkpoint file on
> disk. We can then use that info to reduce the amount of the data to be
> recovered.
>
> Thanks,
>
> Jun
>
>
> On Mon, May 6, 2013 at 3:07 PM, Jason Rosenberg <jb...@squareup.com> wrote:
>
> > Recently, we had an issue where our kafka brokers were shut down hard
> (and
> > so did not write out the clean shutdown file).  Thus on restart, it went
> > through all logs and ran a recovery on them.
> >
> > Unfortunately, this took a long time (on the order of 30 minutes).  We
> have
> > a lot of topics (e.g. ~1000 or so).  Is there anyway this can be done
> more
> > quickly, say in parallel?
> >
> > Also, it be done as a background process, so the server can start up and
> > start receiving messages, logs for incoming topics are prioritized in the
> > recovery process, and perhaps messages can still be buffered in memory
> > while the log recovery is happening?
> >
> > It seems onerous to block all activity for 30 minutes while a slow,
> serial,
> > recovery job happens....
> >
> > Thoughts?
> >
> > Jason
> >
>

Re: slow log recovery

Posted by Jun Rao <ju...@gmail.com>.
In 0.8, if you turn on replication, it may not matter too much if a broker
takes long to start up since data can still be served from the replicas. It
may be possible to improve this by maintaining a flush checkpoint file on
disk. We can then use that info to reduce the amount of the data to be
recovered.

Thanks,

Jun


On Mon, May 6, 2013 at 3:07 PM, Jason Rosenberg <jb...@squareup.com> wrote:

> Recently, we had an issue where our kafka brokers were shut down hard (and
> so did not write out the clean shutdown file).  Thus on restart, it went
> through all logs and ran a recovery on them.
>
> Unfortunately, this took a long time (on the order of 30 minutes).  We have
> a lot of topics (e.g. ~1000 or so).  Is there anyway this can be done more
> quickly, say in parallel?
>
> Also, it be done as a background process, so the server can start up and
> start receiving messages, logs for incoming topics are prioritized in the
> recovery process, and perhaps messages can still be buffered in memory
> while the log recovery is happening?
>
> It seems onerous to block all activity for 30 minutes while a slow, serial,
> recovery job happens....
>
> Thoughts?
>
> Jason
>