You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Neil Harkins <nh...@gmail.com> on 2014/11/06 21:56:06 UTC

nulls found in topic, created by recovery?

Hi all. I saw something weird yesterday on our "leaf" instances
which run kafka 0.7.2 (and mirror to kafka 0.8 via our custom code).
I fully realize everyone's instinctual response is "upgrade, already.",
but I'd like to have an internals discussion to better understand
what happened, as I suspect it's still relevant in 0.8.

Basically, in one of our topics there was an 8k stretch of nulls.
Correlating timestamps from the messages bracketing the nulls
to the kafka log, I see that the server restarted during that time,
and here are the recovery lines related to the topic with the nulls:

[2014-11-04 00:48:07,602] INFO zookeeper state changed (SyncConnected)
(org.I0Itec.zkclient.ZkClient)
[2014-11-04 01:00:35,806] INFO Shutting down Kafka server
(kafka.server.KafkaServer)
[2014-11-04 01:00:35,813] INFO shutdown scheduler kafka-logcleaner-
(kafka.utils.KafkaScheduler)
[2014-11-04 01:01:38,411] INFO Starting Kafka server...
(kafka.server.KafkaServer)
...
[2014-11-04 01:01:49,146] INFO Loading log 'foo.bar-0' (kafka.log.LogManager)
[2014-11-04 01:01:49,147] INFO Loading the last segment
/var/kafka-leaf-spool/foo.bar-0/00000000002684355423.kafka in mutable
mode, recovery true (kafka.log.Log)
[2014-11-04 01:01:55,877] INFO recover high water mark:414004449
(kafka.message.FileMessageSet)
[2014-11-04 01:01:55,877] INFO Recovery succeeded in 6 seconds. 0
bytes truncated. (kafka.message.FileMessageSet)

The only hypothesis I can come up with is that the shutdown
("graceful"?) did not wait for all messages to flush to disk
(we're configured with: log.flush.interval=10000,
log.default.flush.interval.ms=500, and
log.default.flush.scheduler.interval.ms=500),
but the max offset was recorded, so that when it came back up,
it filled the gap with nulls to reach the valid max offset in case
any consumers were at the end.

But for consumers with a position prior to all the nulls,
are they guaranteed to get back "on the rails" so-to-speak?
Nulls appear as v0(i.e. "magic") messages of 0 length,
but the messages replaced could be variable length.

Thanks in advance for any input,
-neil

Re: nulls found in topic, created by recovery?

Posted by Neil Harkins <nh...@gmail.com>.
The topic is not compressed. The consumer used our fork of the python lib,
which I had to modify to get over the nulls.

-neil


On Thu, Nov 6, 2014 at 2:16 PM, Neha Narkhede <ne...@gmail.com> wrote:
> IIRC, the bug that introduced the nulls was related to compressed data. Is
> this topic compressed? Did you try to run a consumer through the topic's
> data or alternately the DumpLogSegments tool?
>
> On Thu, Nov 6, 2014 at 12:56 PM, Neil Harkins <nh...@gmail.com> wrote:
>>
>> Hi all. I saw something weird yesterday on our "leaf" instances
>> which run kafka 0.7.2 (and mirror to kafka 0.8 via our custom code).
>> I fully realize everyone's instinctual response is "upgrade, already.",
>> but I'd like to have an internals discussion to better understand
>> what happened, as I suspect it's still relevant in 0.8.
>>
>> Basically, in one of our topics there was an 8k stretch of nulls.
>> Correlating timestamps from the messages bracketing the nulls
>> to the kafka log, I see that the server restarted during that time,
>> and here are the recovery lines related to the topic with the nulls:
>>
>> [2014-11-04 00:48:07,602] INFO zookeeper state changed (SyncConnected)
>> (org.I0Itec.zkclient.ZkClient)
>> [2014-11-04 01:00:35,806] INFO Shutting down Kafka server
>> (kafka.server.KafkaServer)
>> [2014-11-04 01:00:35,813] INFO shutdown scheduler kafka-logcleaner-
>> (kafka.utils.KafkaScheduler)
>> [2014-11-04 01:01:38,411] INFO Starting Kafka server...
>> (kafka.server.KafkaServer)
>> ...
>> [2014-11-04 01:01:49,146] INFO Loading log 'foo.bar-0'
>> (kafka.log.LogManager)
>> [2014-11-04 01:01:49,147] INFO Loading the last segment
>> /var/kafka-leaf-spool/foo.bar-0/00000000002684355423.kafka in mutable
>> mode, recovery true (kafka.log.Log)
>> [2014-11-04 01:01:55,877] INFO recover high water mark:414004449
>> (kafka.message.FileMessageSet)
>> [2014-11-04 01:01:55,877] INFO Recovery succeeded in 6 seconds. 0
>> bytes truncated. (kafka.message.FileMessageSet)
>>
>> The only hypothesis I can come up with is that the shutdown
>> ("graceful"?) did not wait for all messages to flush to disk
>> (we're configured with: log.flush.interval=10000,
>> log.default.flush.interval.ms=500, and
>> log.default.flush.scheduler.interval.ms=500),
>> but the max offset was recorded, so that when it came back up,
>> it filled the gap with nulls to reach the valid max offset in case
>> any consumers were at the end.
>>
>> But for consumers with a position prior to all the nulls,
>> are they guaranteed to get back "on the rails" so-to-speak?
>> Nulls appear as v0(i.e. "magic") messages of 0 length,
>> but the messages replaced could be variable length.
>>
>> Thanks in advance for any input,
>> -neil
>
>

Re: nulls found in topic, created by recovery?

Posted by Neha Narkhede <ne...@gmail.com>.
IIRC, the bug that introduced the nulls was related to compressed data. Is
this topic compressed? Did you try to run a consumer through the topic's
data or alternately the DumpLogSegments tool?

On Thu, Nov 6, 2014 at 12:56 PM, Neil Harkins <nh...@gmail.com> wrote:

> Hi all. I saw something weird yesterday on our "leaf" instances
> which run kafka 0.7.2 (and mirror to kafka 0.8 via our custom code).
> I fully realize everyone's instinctual response is "upgrade, already.",
> but I'd like to have an internals discussion to better understand
> what happened, as I suspect it's still relevant in 0.8.
>
> Basically, in one of our topics there was an 8k stretch of nulls.
> Correlating timestamps from the messages bracketing the nulls
> to the kafka log, I see that the server restarted during that time,
> and here are the recovery lines related to the topic with the nulls:
>
> [2014-11-04 00:48:07,602] INFO zookeeper state changed (SyncConnected)
> (org.I0Itec.zkclient.ZkClient)
> [2014-11-04 01:00:35,806] INFO Shutting down Kafka server
> (kafka.server.KafkaServer)
> [2014-11-04 01:00:35,813] INFO shutdown scheduler kafka-logcleaner-
> (kafka.utils.KafkaScheduler)
> [2014-11-04 01:01:38,411] INFO Starting Kafka server...
> (kafka.server.KafkaServer)
> ...
> [2014-11-04 01:01:49,146] INFO Loading log 'foo.bar-0'
> (kafka.log.LogManager)
> [2014-11-04 01:01:49,147] INFO Loading the last segment
> /var/kafka-leaf-spool/foo.bar-0/00000000002684355423.kafka in mutable
> mode, recovery true (kafka.log.Log)
> [2014-11-04 01:01:55,877] INFO recover high water mark:414004449
> (kafka.message.FileMessageSet)
> [2014-11-04 01:01:55,877] INFO Recovery succeeded in 6 seconds. 0
> bytes truncated. (kafka.message.FileMessageSet)
>
> The only hypothesis I can come up with is that the shutdown
> ("graceful"?) did not wait for all messages to flush to disk
> (we're configured with: log.flush.interval=10000,
> log.default.flush.interval.ms=500, and
> log.default.flush.scheduler.interval.ms=500),
> but the max offset was recorded, so that when it came back up,
> it filled the gap with nulls to reach the valid max offset in case
> any consumers were at the end.
>
> But for consumers with a position prior to all the nulls,
> are they guaranteed to get back "on the rails" so-to-speak?
> Nulls appear as v0(i.e. "magic") messages of 0 length,
> but the messages replaced could be variable length.
>
> Thanks in advance for any input,
> -neil
>