You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Jason Rosenberg <jb...@squareup.com> on 2014/11/06 15:31:09 UTC

corrupt recovery checkpoint file issue....

Hi,

We recently had a kafka node go down suddenly. When it came back up, it
apparently had a corrupt recovery file, and refused to startup:

2014-11-06 08:17:19,299  WARN [main] server.KafkaServer - Error
starting up KafkaServer
java.lang.NumberFormatException: For input string:
"^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:481)
        at java.lang.Integer.parseInt(Integer.java:527)
        at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
        at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
        at kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76)
        at kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106)
        at kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105)
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
        at kafka.log.LogManager.loadLogs(LogManager.scala:105)
        at kafka.log.LogManager.<init>(LogManager.scala:57)
        at kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275)
        at kafka.server.KafkaServer.startup(KafkaServer.scala:72)

And since the app is under a monitor (so it was repeatedly restarting and
failing with this error for several minutes before we got to it)…

We moved the ‘recovery-point-offset-checkpoint’ file out of the way, and it
then restarted cleanly (but of course re-synced all it’s data from
replicas, so we had no data loss).

Anyway, I’m wondering if that’s the expected behavior? Or should it not
declare it corrupt and then proceed automatically to an unclean restart?

Should this NumberFormatException be handled a bit more gracefully?

We saved the corrupt file if it’s worth inspecting (although I doubt it
will be useful!)….

Jason

Re: corrupt recovery checkpoint file issue....

Posted by Guozhang Wang <wa...@gmail.com>.

Jason,

Yes I agree with you. We should handle this more gracefully as the
checkpoint file dump is not guaranteed atomic. Could you file a JIRA?

Guozhang

On Thu, Nov 6, 2014 at 6:31 AM, Jason Rosenberg <jb...@squareup.com> wrote:

> Hi,
>
> We recently had a kafka node go down suddenly. When it came back up, it
> apparently had a corrupt recovery file, and refused to startup:
>
> 2014-11-06 08:17:19,299  WARN [main] server.KafkaServer - Error
> starting up KafkaServer
> java.lang.NumberFormatException: For input string:
>
> "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
>
> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@"
>         at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>         at java.lang.Integer.parseInt(Integer.java:481)
>         at java.lang.Integer.parseInt(Integer.java:527)
>         at
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
>         at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
>         at kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76)
>         at
> kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106)
>         at
> kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105)
>         at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>         at
> scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>         at kafka.log.LogManager.loadLogs(LogManager.scala:105)
>         at kafka.log.LogManager.<init>(LogManager.scala:57)
>         at kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275)
>         at kafka.server.KafkaServer.startup(KafkaServer.scala:72)
>
> And since the app is under a monitor (so it was repeatedly restarting and
> failing with this error for several minutes before we got to it)…
>
> We moved the ‘recovery-point-offset-checkpoint’ file out of the way, and it
> then restarted cleanly (but of course re-synced all it’s data from
> replicas, so we had no data loss).
>
> Anyway, I’m wondering if that’s the expected behavior? Or should it not
> declare it corrupt and then proceed automatically to an unclean restart?
>
> Should this NumberFormatException be handled a bit more gracefully?
>
> We saved the corrupt file if it’s worth inspecting (although I doubt it
> will be useful!)….
>
> Jason
> 
>



-- 
-- Guozhang

Re: corrupt recovery checkpoint file issue....

Posted by Guozhang Wang <wa...@gmail.com>.

You are right. The swap will be skipped in that case. It seems this
mechanism does not prevent scenarios where the storage system's hard crash.

An orthogonal note: I originally though renameTo in Linux is atomic, but
after reading some JavaDocs I think maybe we should use nio.File.move to be
safer?

https://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#move%28java.nio.file.Path,%20java.nio.file.Path,%20java.nio.file.CopyOption...%29

Guozhang

On Sun, Nov 9, 2014 at 6:10 PM, Jun Rao <ju...@gmail.com> wrote:

> Guozhang,
>
> In OffsetCheckpoint.write(), we don't catch any exceptions. There is only a
> finally clause to close the writer. So, it there is any exception during
> write or sync, the exception will be propagated back to the caller and
> swapping will be skipped.
>
> Thanks,
>
> Jun
>
> On Fri, Nov 7, 2014 at 9:47 AM, Guozhang Wang <wa...@gmail.com> wrote:
>
> > Jun,
> >
> > Checking the OffsetCheckpoint.write function, if
> > "fileOutputStream.getFD.sync" throws exception it will just be caught and
> > forgotten, and the swap will still happen, may be we need to catch the
> > SyncFailedException and re-throw it as a FATAIL error to skip the swap.
> >
> > Guozhang
> >
> >
> > On Thu, Nov 6, 2014 at 8:50 PM, Jason Rosenberg <jb...@squareup.com>
> wrote:
> >
> > > I'm still not sure what caused the reboot of the system (but yes it
> > appears
> > > to have crashed hard).  The file system is xfs, on CentOs linux.  I'm
> not
> > > yet sure, but I think also before the crash, the system might have
> become
> > > wedged.
> > >
> > > It appears the corrupt recovery files actually contained all zero
> bytes,
> > > after looking at it with odb.
> > >
> > > I'll file a Jira.
> > >
> > > On Thu, Nov 6, 2014 at 7:09 PM, Jun Rao <ju...@gmail.com> wrote:
> > >
> > > > I am also wondering how the corruption happened. The way that we
> update
> > > the
> > > > OffsetCheckpoint file is to first write to a tmp file and flush the
> > data.
> > > > We then rename the tmp file to the final file. This is done to
> prevent
> > > > corruption caused by a crash in the middle of the writes. In your
> case,
> > > was
> > > > the host crashed? What kind of storage system are you using? Is there
> > any
> > > > non-volatile cache on the storage system?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Thu, Nov 6, 2014 at 6:31 AM, Jason Rosenberg <jb...@squareup.com>
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > We recently had a kafka node go down suddenly. When it came back
> up,
> > it
> > > > > apparently had a corrupt recovery file, and refused to startup:
> > > > >
> > > > > 2014-11-06 08:17:19,299  WARN [main] server.KafkaServer - Error
> > > > > starting up KafkaServer
> > > > > java.lang.NumberFormatException: For input string:
> > > > >
> > > > >
> > > >
> > >
> >
> "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
> > > > >
> > > > >
> > > >
> > >
> >
> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@"
> > > > >         at
> > > > >
> > > >
> > >
> >
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> > > > >         at java.lang.Integer.parseInt(Integer.java:481)
> > > > >         at java.lang.Integer.parseInt(Integer.java:527)
> > > > >         at
> > > > >
> > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
> > > > >         at
> > > scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
> > > > >         at
> > > kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76)
> > > > >         at
> > > > >
> kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106)
> > > > >         at
> > > > >
> kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105)
> > > > >         at
> > > > >
> > > >
> > >
> >
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> > > > >         at
> > > > >
> scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
> > > > >         at kafka.log.LogManager.loadLogs(LogManager.scala:105)
> > > > >         at kafka.log.LogManager.<init>(LogManager.scala:57)
> > > > >         at
> > > > kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275)
> > > > >         at kafka.server.KafkaServer.startup(KafkaServer.scala:72)
> > > > >
> > > > > And since the app is under a monitor (so it was repeatedly
> restarting
> > > and
> > > > > failing with this error for several minutes before we got to it)…
> > > > >
> > > > > We moved the ‘recovery-point-offset-checkpoint’ file out of the
> way,
> > > and
> > > > it
> > > > > then restarted cleanly (but of course re-synced all it’s data from
> > > > > replicas, so we had no data loss).
> > > > >
> > > > > Anyway, I’m wondering if that’s the expected behavior? Or should it
> > not
> > > > > declare it corrupt and then proceed automatically to an unclean
> > > restart?
> > > > >
> > > > > Should this NumberFormatException be handled a bit more gracefully?
> > > > >
> > > > > We saved the corrupt file if it’s worth inspecting (although I
> doubt
> > it
> > > > > will be useful!)….
> > > > >
> > > > > Jason
> > > > > 
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>



-- 
-- Guozhang

Re: corrupt recovery checkpoint file issue....

Posted by Jun Rao <ju...@gmail.com>.

Guozhang,

In OffsetCheckpoint.write(), we don't catch any exceptions. There is only a
finally clause to close the writer. So, it there is any exception during
write or sync, the exception will be propagated back to the caller and
swapping will be skipped.

Thanks,

Jun

On Fri, Nov 7, 2014 at 9:47 AM, Guozhang Wang <wa...@gmail.com> wrote:

> Jun,
>
> Checking the OffsetCheckpoint.write function, if
> "fileOutputStream.getFD.sync" throws exception it will just be caught and
> forgotten, and the swap will still happen, may be we need to catch the
> SyncFailedException and re-throw it as a FATAIL error to skip the swap.
>
> Guozhang
>
>
> On Thu, Nov 6, 2014 at 8:50 PM, Jason Rosenberg <jb...@squareup.com> wrote:
>
> > I'm still not sure what caused the reboot of the system (but yes it
> appears
> > to have crashed hard).  The file system is xfs, on CentOs linux.  I'm not
> > yet sure, but I think also before the crash, the system might have become
> > wedged.
> >
> > It appears the corrupt recovery files actually contained all zero bytes,
> > after looking at it with odb.
> >
> > I'll file a Jira.
> >
> > On Thu, Nov 6, 2014 at 7:09 PM, Jun Rao <ju...@gmail.com> wrote:
> >
> > > I am also wondering how the corruption happened. The way that we update
> > the
> > > OffsetCheckpoint file is to first write to a tmp file and flush the
> data.
> > > We then rename the tmp file to the final file. This is done to prevent
> > > corruption caused by a crash in the middle of the writes. In your case,
> > was
> > > the host crashed? What kind of storage system are you using? Is there
> any
> > > non-volatile cache on the storage system?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Nov 6, 2014 at 6:31 AM, Jason Rosenberg <jb...@squareup.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > We recently had a kafka node go down suddenly. When it came back up,
> it
> > > > apparently had a corrupt recovery file, and refused to startup:
> > > >
> > > > 2014-11-06 08:17:19,299  WARN [main] server.KafkaServer - Error
> > > > starting up KafkaServer
> > > > java.lang.NumberFormatException: For input string:
> > > >
> > > >
> > >
> >
> "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
> > > >
> > > >
> > >
> >
> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@"
> > > >         at
> > > >
> > >
> >
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> > > >         at java.lang.Integer.parseInt(Integer.java:481)
> > > >         at java.lang.Integer.parseInt(Integer.java:527)
> > > >         at
> > > >
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
> > > >         at
> > scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
> > > >         at
> > kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76)
> > > >         at
> > > > kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106)
> > > >         at
> > > > kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105)
> > > >         at
> > > >
> > >
> >
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> > > >         at
> > > > scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
> > > >         at kafka.log.LogManager.loadLogs(LogManager.scala:105)
> > > >         at kafka.log.LogManager.<init>(LogManager.scala:57)
> > > >         at
> > > kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275)
> > > >         at kafka.server.KafkaServer.startup(KafkaServer.scala:72)
> > > >
> > > > And since the app is under a monitor (so it was repeatedly restarting
> > and
> > > > failing with this error for several minutes before we got to it)…
> > > >
> > > > We moved the ‘recovery-point-offset-checkpoint’ file out of the way,
> > and
> > > it
> > > > then restarted cleanly (but of course re-synced all it’s data from
> > > > replicas, so we had no data loss).
> > > >
> > > > Anyway, I’m wondering if that’s the expected behavior? Or should it
> not
> > > > declare it corrupt and then proceed automatically to an unclean
> > restart?
> > > >
> > > > Should this NumberFormatException be handled a bit more gracefully?
> > > >
> > > > We saved the corrupt file if it’s worth inspecting (although I doubt
> it
> > > > will be useful!)….
> > > >
> > > > Jason
> > > > 
> > > >
> > >
> >
>
>
>
> --
> -- Guozhang
>

Re: corrupt recovery checkpoint file issue....

Posted by Guozhang Wang <wa...@gmail.com>.

Jun,

Checking the OffsetCheckpoint.write function, if
"fileOutputStream.getFD.sync" throws exception it will just be caught and
forgotten, and the swap will still happen, may be we need to catch the
SyncFailedException and re-throw it as a FATAIL error to skip the swap.

Guozhang


On Thu, Nov 6, 2014 at 8:50 PM, Jason Rosenberg <jb...@squareup.com> wrote:

> I'm still not sure what caused the reboot of the system (but yes it appears
> to have crashed hard).  The file system is xfs, on CentOs linux.  I'm not
> yet sure, but I think also before the crash, the system might have become
> wedged.
>
> It appears the corrupt recovery files actually contained all zero bytes,
> after looking at it with odb.
>
> I'll file a Jira.
>
> On Thu, Nov 6, 2014 at 7:09 PM, Jun Rao <ju...@gmail.com> wrote:
>
> > I am also wondering how the corruption happened. The way that we update
> the
> > OffsetCheckpoint file is to first write to a tmp file and flush the data.
> > We then rename the tmp file to the final file. This is done to prevent
> > corruption caused by a crash in the middle of the writes. In your case,
> was
> > the host crashed? What kind of storage system are you using? Is there any
> > non-volatile cache on the storage system?
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Nov 6, 2014 at 6:31 AM, Jason Rosenberg <jb...@squareup.com>
> wrote:
> >
> > > Hi,
> > >
> > > We recently had a kafka node go down suddenly. When it came back up, it
> > > apparently had a corrupt recovery file, and refused to startup:
> > >
> > > 2014-11-06 08:17:19,299  WARN [main] server.KafkaServer - Error
> > > starting up KafkaServer
> > > java.lang.NumberFormatException: For input string:
> > >
> > >
> >
> "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
> > >
> > >
> >
> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@"
> > >         at
> > >
> >
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> > >         at java.lang.Integer.parseInt(Integer.java:481)
> > >         at java.lang.Integer.parseInt(Integer.java:527)
> > >         at
> > > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
> > >         at
> scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
> > >         at
> kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76)
> > >         at
> > > kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106)
> > >         at
> > > kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105)
> > >         at
> > >
> >
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> > >         at
> > > scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
> > >         at kafka.log.LogManager.loadLogs(LogManager.scala:105)
> > >         at kafka.log.LogManager.<init>(LogManager.scala:57)
> > >         at
> > kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275)
> > >         at kafka.server.KafkaServer.startup(KafkaServer.scala:72)
> > >
> > > And since the app is under a monitor (so it was repeatedly restarting
> and
> > > failing with this error for several minutes before we got to it)…
> > >
> > > We moved the ‘recovery-point-offset-checkpoint’ file out of the way,
> and
> > it
> > > then restarted cleanly (but of course re-synced all it’s data from
> > > replicas, so we had no data loss).
> > >
> > > Anyway, I’m wondering if that’s the expected behavior? Or should it not
> > > declare it corrupt and then proceed automatically to an unclean
> restart?
> > >
> > > Should this NumberFormatException be handled a bit more gracefully?
> > >
> > > We saved the corrupt file if it’s worth inspecting (although I doubt it
> > > will be useful!)….
> > >
> > > Jason
> > > 
> > >
> >
>



-- 
-- Guozhang

Re: corrupt recovery checkpoint file issue....

Posted by Jason Rosenberg <jb...@squareup.com>.

filed: https://issues.apache.org/jira/browse/KAFKA-1758

On Thu, Nov 6, 2014 at 11:50 PM, Jason Rosenberg <jb...@squareup.com> wrote:

> I'm still not sure what caused the reboot of the system (but yes it
> appears to have crashed hard).  The file system is xfs, on CentOs linux.
> I'm not yet sure, but I think also before the crash, the system might have
> become wedged.
>
> It appears the corrupt recovery files actually contained all zero bytes,
> after looking at it with odb.
>
> I'll file a Jira.
>
> On Thu, Nov 6, 2014 at 7:09 PM, Jun Rao <ju...@gmail.com> wrote:
>
>> I am also wondering how the corruption happened. The way that we update
>> the
>> OffsetCheckpoint file is to first write to a tmp file and flush the data.
>> We then rename the tmp file to the final file. This is done to prevent
>> corruption caused by a crash in the middle of the writes. In your case,
>> was
>> the host crashed? What kind of storage system are you using? Is there any
>> non-volatile cache on the storage system?
>>
>> Thanks,
>>
>> Jun
>>
>> On Thu, Nov 6, 2014 at 6:31 AM, Jason Rosenberg <jb...@squareup.com> wrote:
>>
>> > Hi,
>> >
>> > We recently had a kafka node go down suddenly. When it came back up, it
>> > apparently had a corrupt recovery file, and refused to startup:
>> >
>> > 2014-11-06 08:17:19,299  WARN [main] server.KafkaServer - Error
>> > starting up KafkaServer
>> > java.lang.NumberFormatException: For input string:
>> >
>> >
>> "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
>> >
>> >
>> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@"
>> >         at
>> >
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>> >         at java.lang.Integer.parseInt(Integer.java:481)
>> >         at java.lang.Integer.parseInt(Integer.java:527)
>> >         at
>> > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
>> >         at
>> scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
>> >         at kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76)
>> >         at
>> > kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106)
>> >         at
>> > kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105)
>> >         at
>> >
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>> >         at
>> > scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>> >         at kafka.log.LogManager.loadLogs(LogManager.scala:105)
>> >         at kafka.log.LogManager.<init>(LogManager.scala:57)
>> >         at
>> kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275)
>> >         at kafka.server.KafkaServer.startup(KafkaServer.scala:72)
>> >
>> > And since the app is under a monitor (so it was repeatedly restarting
>> and
>> > failing with this error for several minutes before we got to it)…
>> >
>> > We moved the ‘recovery-point-offset-checkpoint’ file out of the way,
>> and it
>> > then restarted cleanly (but of course re-synced all it’s data from
>> > replicas, so we had no data loss).
>> >
>> > Anyway, I’m wondering if that’s the expected behavior? Or should it not
>> > declare it corrupt and then proceed automatically to an unclean restart?
>> >
>> > Should this NumberFormatException be handled a bit more gracefully?
>> >
>> > We saved the corrupt file if it’s worth inspecting (although I doubt it
>> > will be useful!)….
>> >
>> > Jason
>> > 
>> >
>>
>
>

Re: corrupt recovery checkpoint file issue....

Posted by Jason Rosenberg <jb...@squareup.com>.

I'm still not sure what caused the reboot of the system (but yes it appears
to have crashed hard).  The file system is xfs, on CentOs linux.  I'm not
yet sure, but I think also before the crash, the system might have become
wedged.

It appears the corrupt recovery files actually contained all zero bytes,
after looking at it with odb.

I'll file a Jira.

On Thu, Nov 6, 2014 at 7:09 PM, Jun Rao <ju...@gmail.com> wrote:

> I am also wondering how the corruption happened. The way that we update the
> OffsetCheckpoint file is to first write to a tmp file and flush the data.
> We then rename the tmp file to the final file. This is done to prevent
> corruption caused by a crash in the middle of the writes. In your case, was
> the host crashed? What kind of storage system are you using? Is there any
> non-volatile cache on the storage system?
>
> Thanks,
>
> Jun
>
> On Thu, Nov 6, 2014 at 6:31 AM, Jason Rosenberg <jb...@squareup.com> wrote:
>
> > Hi,
> >
> > We recently had a kafka node go down suddenly. When it came back up, it
> > apparently had a corrupt recovery file, and refused to startup:
> >
> > 2014-11-06 08:17:19,299  WARN [main] server.KafkaServer - Error
> > starting up KafkaServer
> > java.lang.NumberFormatException: For input string:
> >
> >
> "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
> >
> >
> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@"
> >         at
> >
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> >         at java.lang.Integer.parseInt(Integer.java:481)
> >         at java.lang.Integer.parseInt(Integer.java:527)
> >         at
> > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
> >         at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
> >         at kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76)
> >         at
> > kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106)
> >         at
> > kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105)
> >         at
> >
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> >         at
> > scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
> >         at kafka.log.LogManager.loadLogs(LogManager.scala:105)
> >         at kafka.log.LogManager.<init>(LogManager.scala:57)
> >         at
> kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275)
> >         at kafka.server.KafkaServer.startup(KafkaServer.scala:72)
> >
> > And since the app is under a monitor (so it was repeatedly restarting and
> > failing with this error for several minutes before we got to it)…
> >
> > We moved the ‘recovery-point-offset-checkpoint’ file out of the way, and
> it
> > then restarted cleanly (but of course re-synced all it’s data from
> > replicas, so we had no data loss).
> >
> > Anyway, I’m wondering if that’s the expected behavior? Or should it not
> > declare it corrupt and then proceed automatically to an unclean restart?
> >
> > Should this NumberFormatException be handled a bit more gracefully?
> >
> > We saved the corrupt file if it’s worth inspecting (although I doubt it
> > will be useful!)….
> >
> > Jason
> > 
> >
>

Re: corrupt recovery checkpoint file issue....

Posted by Jun Rao <ju...@gmail.com>.

I am also wondering how the corruption happened. The way that we update the
OffsetCheckpoint file is to first write to a tmp file and flush the data.
We then rename the tmp file to the final file. This is done to prevent
corruption caused by a crash in the middle of the writes. In your case, was
the host crashed? What kind of storage system are you using? Is there any
non-volatile cache on the storage system?

Thanks,

Jun

On Thu, Nov 6, 2014 at 6:31 AM, Jason Rosenberg <jb...@squareup.com> wrote:

> Hi,
>
> We recently had a kafka node go down suddenly. When it came back up, it
> apparently had a corrupt recovery file, and refused to startup:
>
> 2014-11-06 08:17:19,299  WARN [main] server.KafkaServer - Error
> starting up KafkaServer
> java.lang.NumberFormatException: For input string:
>
> "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
>
> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@"
>         at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>         at java.lang.Integer.parseInt(Integer.java:481)
>         at java.lang.Integer.parseInt(Integer.java:527)
>         at
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
>         at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
>         at kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76)
>         at
> kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106)
>         at
> kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105)
>         at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>         at
> scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>         at kafka.log.LogManager.loadLogs(LogManager.scala:105)
>         at kafka.log.LogManager.<init>(LogManager.scala:57)
>         at kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275)
>         at kafka.server.KafkaServer.startup(KafkaServer.scala:72)
>
> And since the app is under a monitor (so it was repeatedly restarting and
> failing with this error for several minutes before we got to it)…
>
> We moved the ‘recovery-point-offset-checkpoint’ file out of the way, and it
> then restarted cleanly (but of course re-synced all it’s data from
> replicas, so we had no data loss).
>
> Anyway, I’m wondering if that’s the expected behavior? Or should it not
> declare it corrupt and then proceed automatically to an unclean restart?
>
> Should this NumberFormatException be handled a bit more gracefully?
>
> We saved the corrupt file if it’s worth inspecting (although I doubt it
> will be useful!)….
>
> Jason
> 
>

Re: corrupt recovery checkpoint file issue....

Posted by Jason Rosenberg <jb...@squareup.com>.

forgot to mention, we are using 0.8.1.1....

Jason

On Thu, Nov 6, 2014 at 9:31 AM, Jason Rosenberg <jb...@squareup.com> wrote:

> Hi,
>
> We recently had a kafka node go down suddenly. When it came back up, it
> apparently had a corrupt recovery file, and refused to startup:
>
> 2014-11-06 08:17:19,299  WARN [main] server.KafkaServer - Error starting up KafkaServer
> java.lang.NumberFormatException: For input string: "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@"
>         at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>         at java.lang.Integer.parseInt(Integer.java:481)
>         at java.lang.Integer.parseInt(Integer.java:527)
>         at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
>         at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
>         at kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76)
>         at kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106)
>         at kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105)
>         at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>         at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>         at kafka.log.LogManager.loadLogs(LogManager.scala:105)
>         at kafka.log.LogManager.<init>(LogManager.scala:57)
>         at kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275)
>         at kafka.server.KafkaServer.startup(KafkaServer.scala:72)
>
> And since the app is under a monitor (so it was repeatedly restarting and
> failing with this error for several minutes before we got to it)…
>
> We moved the ‘recovery-point-offset-checkpoint’ file out of the way, and
> it then restarted cleanly (but of course re-synced all it’s data from
> replicas, so we had no data loss).
>
> Anyway, I’m wondering if that’s the expected behavior? Or should it not
> declare it corrupt and then proceed automatically to an unclean restart?
>
> Should this NumberFormatException be handled a bit more gracefully?
>
> We saved the corrupt file if it’s worth inspecting (although I doubt it
> will be useful!)….
>
> Jason
> 
>