You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Taylor Gautier <tg...@tagged.com> on 2011/11/18 06:03:04 UTC

the cleaner and log segments

Hi,

We've noticed that the cleaner script in Kafka removes empty log segments
but not the directories themselves.  I am actually wondering something - I
always assumed that Kafka could restore the latest offset for existing
topics by scanning the log directory for all directories and scanning the
directories for log segment files to restore the latest offset.

Now this conclusion I have made simply by observation - so it could be
entirely wrong.

My question is however - if I am right, and the cleaner removes all the log
segments for a given topic so that a given topic directory is empty, how
does Kafka behave when restarted?  How does it know what the next offset
should be?

Re: the cleaner and log segments

Posted by Jun Rao <ju...@gmail.com>.
In the broker, the name of each log file contains the offset of the first
message in that file. So the last offset can be computed by filename +
filelength.

Jun

On Fri, Nov 18, 2011 at 8:52 AM, Taylor Gautier <tg...@tagged.com> wrote:

> Right. I'm talking about the broker. Where does it store what is the
> most recent offset if there are no log segments?  And no ZK.
>
>
>
> On Nov 18, 2011, at 8:50 AM, Jun Rao <ju...@gmail.com> wrote:
>
> > What I described is what happens in the broker. If you use
> SimpleConsumer,
> > then it's the consumer's responsibility to remember the last offset. The
> > server doesn't store the state for consumers.
> >
> > Thanks,
> >
> > Jun
> >
> > On Fri, Nov 18, 2011 at 8:19 AM, Taylor Gautier <tg...@tagged.com>
> wrote:
> >
> >> how?  where is the information kept?  If ZK is not around, and it's not
> on
> >> disk, how is this information passed to the next process after the
> restart?
> >>
> >> On Fri, Nov 18, 2011 at 8:04 AM, Jun Rao <ju...@gmail.com> wrote:
> >>
> >>> 4) is incorrect. "Last offset" remains to be 'a' even after the data is
> >>> cleaned. So in 5), the offset will be 2 x 'a'. That is, we never
> recycle
> >>> offsets. They keep increasing.
> >>>
> >>> Thanks,
> >>>
> >>> Jun
> >>>
> >>> On Fri, Nov 18, 2011 at 7:02 AM, Taylor Gautier <tg...@tagged.com>
> >>> wrote:
> >>>
> >>>> I don't use high level consumers - just low level.  What I was
> thinking
> >>> was
> >>>> the following.  Let's assume I have turned off ZK in my setup.
> >>>>
> >>>> 1) Send 1 message to topic A.  Kafka creates a directory and log
> >> segment
> >>>> for A.  The log segment starts at 0.   Now, the "last offset" of the
> >>> topic
> >>>> is a.
> >>>>
> >>>> 2) A consumer reads from topic A the message, and records that the
> most
> >>>> recent offset in topic A is a.
> >>>>
> >>>> 3) Much time passes, the cleaner runs, and deletes the log segment
> >>>>
> >>>> 4) More time passes, I restart Kafka.  Kafka sees the topic A
> >> directory,
> >>>> but has no segment file to initialize from.  So the "last offset" is
> >>>> considered to be 0.
> >>>>
> >>>> 5) Send 1 message to topic A.  Kafka creates a log segment for A
> >> starting
> >>>> at 0.   The new last offset of the topic is a'.
> >>>>
> >>>> 6) The consumer from step 2 tries to read from Kafka at offset a, but
> >>> this
> >>>> is now an invalid offset.
> >>>>
> >>>> Does that sound right?  I haven't tried this yet, I'm just doing a
> >>> thought
> >>>> experiment here to try to figure out what would happen.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Nov 17, 2011 at 11:01 PM, Jun Rao <ju...@gmail.com> wrote:
> >>>>
> >>>>> This is true for the high-level ZK-based consumer.
> >>>>>
> >>>>> Jun
> >>>>>
> >>>>> On Thu, Nov 17, 2011 at 10:59 PM, Inder Pall <in...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>> Jun & Taylor,
> >>>>>> would it be right to say that consumers without ZK won't be a
> >> viable
> >>>>> option
> >>>>>> if you can't handle replay of old messages in your application.
> >>>>>>
> >>>>>> - inder
> >>>>>>
> >>>>>> On Fri, Nov 18, 2011 at 12:27 PM, Jun Rao <ju...@gmail.com>
> >> wrote:
> >>>>>>
> >>>>>>> Taylor,
> >>>>>>>
> >>>>>>> When you start a consumer, it always tries to get the last
> >>>> checkpointed
> >>>>>>> offset from ZK. If no offset can be found in ZK, the consumer
> >>> starts
> >>>>> from
> >>>>>>> either the smallest or the largest available offset in the
> >> broker.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Jun
> >>>>>>>
> >>>>>>> On Thu, Nov 17, 2011 at 9:20 PM, Taylor Gautier <
> >>> tgautier@tagged.com
> >>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> hmmm - and if you turn off zookeeper?
> >>>>>>>>
> >>>>>>>> On Thu, Nov 17, 2011 at 9:15 PM, Inder Pall <
> >>> inder.pall@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> The consumer offsets are stored in ZooKeeper by topic and
> >>>>> partition.
> >>>>>>>>> That's how in a consumer fail over scenario you don't get
> >>>> duplicate
> >>>>>>>>> messages
> >>>>>>>>>
> >>>>>>>>> - Inder
> >>>>>>>>>
> >>>>>>>>> On Fri, Nov 18, 2011 at 10:33 AM, Taylor Gautier <
> >>>>>> tgautier@tagged.com
> >>>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> We've noticed that the cleaner script in Kafka removes
> >> empty
> >>>> log
> >>>>>>>> segments
> >>>>>>>>>> but not the directories themselves.  I am actually
> >> wondering
> >>>>>>> something
> >>>>>>>> -
> >>>>>>>>> I
> >>>>>>>>>> always assumed that Kafka could restore the latest offset
> >> for
> >>>>>>> existing
> >>>>>>>>>> topics by scanning the log directory for all directories
> >> and
> >>>>>> scanning
> >>>>>>>> the
> >>>>>>>>>> directories for log segment files to restore the latest
> >>> offset.
> >>>>>>>>>>
> >>>>>>>>>> Now this conclusion I have made simply by observation - so
> >> it
> >>>>> could
> >>>>>>> be
> >>>>>>>>>> entirely wrong.
> >>>>>>>>>>
> >>>>>>>>>> My question is however - if I am right, and the cleaner
> >>> removes
> >>>>> all
> >>>>>>> the
> >>>>>>>>> log
> >>>>>>>>>> segments for a given topic so that a given topic directory
> >> is
> >>>>>> empty,
> >>>>>>>> how
> >>>>>>>>>> does Kafka behave when restarted?  How does it know what
> >> the
> >>>> next
> >>>>>>>> offset
> >>>>>>>>>> should be?
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> -- Inder
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> -- Inder
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>

Re: the cleaner and log segments

Posted by Taylor Gautier <tg...@tagged.com>.
Hmm…it *definitely* does not work right in 0.6.  We actually take advantage
of it to clean up dead topics.  Our current use case is very different from
what kafka was designed for - we have hundreds of thousands of topics that
individually get very little traffic.

As you can surmise - not making topics on read (KAFKA-101) was a very
important feature for this use case.

On Wed, Nov 23, 2011 at 9:40 AM, Jun Rao <ju...@gmail.com> wrote:

> Yes.
>
> Jun
>
> On Wed, Nov 23, 2011 at 8:22 AM, Chris Burroughs
> <ch...@gmail.com>wrote:
>
> > Was that "write an empty log segment" feature always there?
> >
> > On 11/18/2011 06:39 PM, Joel Koshy wrote:
> > > Just want to see if I understand this right - when the log cleaner
> > > does its thing, even if all the segments are eligible for garbage
> > > collection the cleaner will nuke those files and should deposit an
> > > empty segment file named with the next valid offset in that partition.
> > > I think Taylor encountered a case where that empty segment was not
> > > added. Is this the race condition that you speak of? If for e.g., the
> > > broker crashes before that empty segment file is created...
> > >
> > > Also, I have seen the log cleaner act up more than once in the past -
> > > basically seems to get scheduled continuously and delete file 0000...
> > > I think someone else on the list saw that before. I have been unable
> > > to reproduce that though - and it is not impossible that there was a
> > > misconfiguration at play.
> > >
> > > Thanks,
> > >
> > > Joel
> > >
> > > On Fri, Nov 18, 2011 at 11:50 AM, Taylor Gautier <tg...@tagged.com>
> > wrote:
> > >> Ok that's what we are already doing.  In essence when that happens it
> > >> is a bit like a rollover. Except depending on the values it might be
> > >> the case that a consumer has a low enough value that web it requests
> > >> the topic the value is still within range but is not valid since
> > >> messages were delivered to the broker. Essentially it's a race
> > >> condition that might be somewhat hard to induce but is theoretically
> > >> possible. During a rollover of 64-bits this is more or less never
> > >> going to happen because 64-bits is just too large to open a time
> > >> window long enough for the race to occur.
> > >>
> > >>
> > >>
> > >> On Nov 18, 2011, at 10:32 AM, Jun Rao <ju...@gmail.com> wrote:
> > >>
> > >>> Taylor,
> > >>>
> > >>> If you request an offset whose corresponding log file has been
> > deleted, you
> > >>> will get OutOfRange exception. When this happens, you can use the
> > >>> getLatestOffset api in SimpleConsumer to obtain either the current
> > valid
> > >>> smallest or largest offset and reconsume from there. Our high level
> > >>> consumer does that for you (among many other things). That's why we
> > >>> encourage most users to use the high level api instead.
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Jun
> > >>
> >
> >
>

Re: the cleaner and log segments

Posted by Jun Rao <ju...@gmail.com>.
Yes.

Jun

On Wed, Nov 23, 2011 at 8:22 AM, Chris Burroughs
<ch...@gmail.com>wrote:

> Was that "write an empty log segment" feature always there?
>
> On 11/18/2011 06:39 PM, Joel Koshy wrote:
> > Just want to see if I understand this right - when the log cleaner
> > does its thing, even if all the segments are eligible for garbage
> > collection the cleaner will nuke those files and should deposit an
> > empty segment file named with the next valid offset in that partition.
> > I think Taylor encountered a case where that empty segment was not
> > added. Is this the race condition that you speak of? If for e.g., the
> > broker crashes before that empty segment file is created...
> >
> > Also, I have seen the log cleaner act up more than once in the past -
> > basically seems to get scheduled continuously and delete file 0000...
> > I think someone else on the list saw that before. I have been unable
> > to reproduce that though - and it is not impossible that there was a
> > misconfiguration at play.
> >
> > Thanks,
> >
> > Joel
> >
> > On Fri, Nov 18, 2011 at 11:50 AM, Taylor Gautier <tg...@tagged.com>
> wrote:
> >> Ok that's what we are already doing.  In essence when that happens it
> >> is a bit like a rollover. Except depending on the values it might be
> >> the case that a consumer has a low enough value that web it requests
> >> the topic the value is still within range but is not valid since
> >> messages were delivered to the broker. Essentially it's a race
> >> condition that might be somewhat hard to induce but is theoretically
> >> possible. During a rollover of 64-bits this is more or less never
> >> going to happen because 64-bits is just too large to open a time
> >> window long enough for the race to occur.
> >>
> >>
> >>
> >> On Nov 18, 2011, at 10:32 AM, Jun Rao <ju...@gmail.com> wrote:
> >>
> >>> Taylor,
> >>>
> >>> If you request an offset whose corresponding log file has been
> deleted, you
> >>> will get OutOfRange exception. When this happens, you can use the
> >>> getLatestOffset api in SimpleConsumer to obtain either the current
> valid
> >>> smallest or largest offset and reconsume from there. Our high level
> >>> consumer does that for you (among many other things). That's why we
> >>> encourage most users to use the high level api instead.
> >>>
> >>> Thanks,
> >>>
> >>> Jun
> >>
>
>

Re: the cleaner and log segments

Posted by Taylor Gautier <tg...@tagged.com>.
At some point I will - right now we are unfortunately stuck with 0.6.  The
biggest problem being that the message format changed for the binary
protocol, and I would have to upgrade my clients just to try it out.

On Sat, Nov 19, 2011 at 9:17 AM, Jun Rao <ju...@gmail.com> wrote:

> Could you try the 0.7 RC?
>
> Thanks,
>
> Jun
>
> On Sat, Nov 19, 2011 at 7:32 AM, Taylor Gautier <tg...@tagged.com>
> wrote:
>
> > Oh - well if that's what is supposed to happen - it's not. I don't
> > think it's not happening because of a race condition. It seems to be
> > intentional that it is just removing the segment file and not creating
> > anything because it is a fairly consistent behavior.
> >
> > Note that I'm using 0.6.
> >
> > On Nov 18, 2011, at 3:40 PM, Joel Koshy <jj...@gmail.com> wrote:
> >
> > > Just want to see if I understand this right - when the log cleaner
> > > does its thing, even if all the segments are eligible for garbage
> > > collection the cleaner will nuke those files and should deposit an
> > > empty segment file named with the next valid offset in that partition.
> > > I think Taylor encountered a case where that empty segment was not
> > > added. Is this the race condition that you speak of? If for e.g., the
> > > broker crashes before that empty segment file is created...
> > >
> > > Also, I have seen the log cleaner act up more than once in the past -
> > > basically seems to get scheduled continuously and delete file 0000...
> > > I think someone else on the list saw that before. I have been unable
> > > to reproduce that though - and it is not impossible that there was a
> > > misconfiguration at play.
> > >
> > > Thanks,
> > >
> > > Joel
> > >
> > > On Fri, Nov 18, 2011 at 11:50 AM, Taylor Gautier <tg...@tagged.com>
> > wrote:
> > >> Ok that's what we are already doing.  In essence when that happens it
> > >> is a bit like a rollover. Except depending on the values it might be
> > >> the case that a consumer has a low enough value that web it requests
> > >> the topic the value is still within range but is not valid since
> > >> messages were delivered to the broker. Essentially it's a race
> > >> condition that might be somewhat hard to induce but is theoretically
> > >> possible. During a rollover of 64-bits this is more or less never
> > >> going to happen because 64-bits is just too large to open a time
> > >> window long enough for the race to occur.
> > >>
> > >>
> > >>
> > >> On Nov 18, 2011, at 10:32 AM, Jun Rao <ju...@gmail.com> wrote:
> > >>
> > >>> Taylor,
> > >>>
> > >>> If you request an offset whose corresponding log file has been
> > deleted, you
> > >>> will get OutOfRange exception. When this happens, you can use the
> > >>> getLatestOffset api in SimpleConsumer to obtain either the current
> > valid
> > >>> smallest or largest offset and reconsume from there. Our high level
> > >>> consumer does that for you (among many other things). That's why we
> > >>> encourage most users to use the high level api instead.
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Jun
> > >>
> >
>

Re: the cleaner and log segments

Posted by Jun Rao <ju...@gmail.com>.
Could you try the 0.7 RC?

Thanks,

Jun

On Sat, Nov 19, 2011 at 7:32 AM, Taylor Gautier <tg...@tagged.com> wrote:

> Oh - well if that's what is supposed to happen - it's not. I don't
> think it's not happening because of a race condition. It seems to be
> intentional that it is just removing the segment file and not creating
> anything because it is a fairly consistent behavior.
>
> Note that I'm using 0.6.
>
> On Nov 18, 2011, at 3:40 PM, Joel Koshy <jj...@gmail.com> wrote:
>
> > Just want to see if I understand this right - when the log cleaner
> > does its thing, even if all the segments are eligible for garbage
> > collection the cleaner will nuke those files and should deposit an
> > empty segment file named with the next valid offset in that partition.
> > I think Taylor encountered a case where that empty segment was not
> > added. Is this the race condition that you speak of? If for e.g., the
> > broker crashes before that empty segment file is created...
> >
> > Also, I have seen the log cleaner act up more than once in the past -
> > basically seems to get scheduled continuously and delete file 0000...
> > I think someone else on the list saw that before. I have been unable
> > to reproduce that though - and it is not impossible that there was a
> > misconfiguration at play.
> >
> > Thanks,
> >
> > Joel
> >
> > On Fri, Nov 18, 2011 at 11:50 AM, Taylor Gautier <tg...@tagged.com>
> wrote:
> >> Ok that's what we are already doing.  In essence when that happens it
> >> is a bit like a rollover. Except depending on the values it might be
> >> the case that a consumer has a low enough value that web it requests
> >> the topic the value is still within range but is not valid since
> >> messages were delivered to the broker. Essentially it's a race
> >> condition that might be somewhat hard to induce but is theoretically
> >> possible. During a rollover of 64-bits this is more or less never
> >> going to happen because 64-bits is just too large to open a time
> >> window long enough for the race to occur.
> >>
> >>
> >>
> >> On Nov 18, 2011, at 10:32 AM, Jun Rao <ju...@gmail.com> wrote:
> >>
> >>> Taylor,
> >>>
> >>> If you request an offset whose corresponding log file has been
> deleted, you
> >>> will get OutOfRange exception. When this happens, you can use the
> >>> getLatestOffset api in SimpleConsumer to obtain either the current
> valid
> >>> smallest or largest offset and reconsume from there. Our high level
> >>> consumer does that for you (among many other things). That's why we
> >>> encourage most users to use the high level api instead.
> >>>
> >>> Thanks,
> >>>
> >>> Jun
> >>
>

Re: the cleaner and log segments

Posted by Taylor Gautier <tg...@tagged.com>.
Oh - well if that's what is supposed to happen - it's not. I don't
think it's not happening because of a race condition. It seems to be
intentional that it is just removing the segment file and not creating
anything because it is a fairly consistent behavior.

Note that I'm using 0.6.

On Nov 18, 2011, at 3:40 PM, Joel Koshy <jj...@gmail.com> wrote:

> Just want to see if I understand this right - when the log cleaner
> does its thing, even if all the segments are eligible for garbage
> collection the cleaner will nuke those files and should deposit an
> empty segment file named with the next valid offset in that partition.
> I think Taylor encountered a case where that empty segment was not
> added. Is this the race condition that you speak of? If for e.g., the
> broker crashes before that empty segment file is created...
>
> Also, I have seen the log cleaner act up more than once in the past -
> basically seems to get scheduled continuously and delete file 0000...
> I think someone else on the list saw that before. I have been unable
> to reproduce that though - and it is not impossible that there was a
> misconfiguration at play.
>
> Thanks,
>
> Joel
>
> On Fri, Nov 18, 2011 at 11:50 AM, Taylor Gautier <tg...@tagged.com> wrote:
>> Ok that's what we are already doing.  In essence when that happens it
>> is a bit like a rollover. Except depending on the values it might be
>> the case that a consumer has a low enough value that web it requests
>> the topic the value is still within range but is not valid since
>> messages were delivered to the broker. Essentially it's a race
>> condition that might be somewhat hard to induce but is theoretically
>> possible. During a rollover of 64-bits this is more or less never
>> going to happen because 64-bits is just too large to open a time
>> window long enough for the race to occur.
>>
>>
>>
>> On Nov 18, 2011, at 10:32 AM, Jun Rao <ju...@gmail.com> wrote:
>>
>>> Taylor,
>>>
>>> If you request an offset whose corresponding log file has been deleted, you
>>> will get OutOfRange exception. When this happens, you can use the
>>> getLatestOffset api in SimpleConsumer to obtain either the current valid
>>> smallest or largest offset and reconsume from there. Our high level
>>> consumer does that for you (among many other things). That's why we
>>> encourage most users to use the high level api instead.
>>>
>>> Thanks,
>>>
>>> Jun
>>

Re: the cleaner and log segments

Posted by Chris Burroughs <ch...@gmail.com>.
Was that "write an empty log segment" feature always there?

On 11/18/2011 06:39 PM, Joel Koshy wrote:
> Just want to see if I understand this right - when the log cleaner
> does its thing, even if all the segments are eligible for garbage
> collection the cleaner will nuke those files and should deposit an
> empty segment file named with the next valid offset in that partition.
> I think Taylor encountered a case where that empty segment was not
> added. Is this the race condition that you speak of? If for e.g., the
> broker crashes before that empty segment file is created...
> 
> Also, I have seen the log cleaner act up more than once in the past -
> basically seems to get scheduled continuously and delete file 0000...
> I think someone else on the list saw that before. I have been unable
> to reproduce that though - and it is not impossible that there was a
> misconfiguration at play.
> 
> Thanks,
> 
> Joel
> 
> On Fri, Nov 18, 2011 at 11:50 AM, Taylor Gautier <tg...@tagged.com> wrote:
>> Ok that's what we are already doing.  In essence when that happens it
>> is a bit like a rollover. Except depending on the values it might be
>> the case that a consumer has a low enough value that web it requests
>> the topic the value is still within range but is not valid since
>> messages were delivered to the broker. Essentially it's a race
>> condition that might be somewhat hard to induce but is theoretically
>> possible. During a rollover of 64-bits this is more or less never
>> going to happen because 64-bits is just too large to open a time
>> window long enough for the race to occur.
>>
>>
>>
>> On Nov 18, 2011, at 10:32 AM, Jun Rao <ju...@gmail.com> wrote:
>>
>>> Taylor,
>>>
>>> If you request an offset whose corresponding log file has been deleted, you
>>> will get OutOfRange exception. When this happens, you can use the
>>> getLatestOffset api in SimpleConsumer to obtain either the current valid
>>> smallest or largest offset and reconsume from there. Our high level
>>> consumer does that for you (among many other things). That's why we
>>> encourage most users to use the high level api instead.
>>>
>>> Thanks,
>>>
>>> Jun
>>


Re: the cleaner and log segments

Posted by Joel Koshy <jj...@gmail.com>.
Just want to see if I understand this right - when the log cleaner
does its thing, even if all the segments are eligible for garbage
collection the cleaner will nuke those files and should deposit an
empty segment file named with the next valid offset in that partition.
I think Taylor encountered a case where that empty segment was not
added. Is this the race condition that you speak of? If for e.g., the
broker crashes before that empty segment file is created...

Also, I have seen the log cleaner act up more than once in the past -
basically seems to get scheduled continuously and delete file 0000...
I think someone else on the list saw that before. I have been unable
to reproduce that though - and it is not impossible that there was a
misconfiguration at play.

Thanks,

Joel

On Fri, Nov 18, 2011 at 11:50 AM, Taylor Gautier <tg...@tagged.com> wrote:
> Ok that's what we are already doing.  In essence when that happens it
> is a bit like a rollover. Except depending on the values it might be
> the case that a consumer has a low enough value that web it requests
> the topic the value is still within range but is not valid since
> messages were delivered to the broker. Essentially it's a race
> condition that might be somewhat hard to induce but is theoretically
> possible. During a rollover of 64-bits this is more or less never
> going to happen because 64-bits is just too large to open a time
> window long enough for the race to occur.
>
>
>
> On Nov 18, 2011, at 10:32 AM, Jun Rao <ju...@gmail.com> wrote:
>
>> Taylor,
>>
>> If you request an offset whose corresponding log file has been deleted, you
>> will get OutOfRange exception. When this happens, you can use the
>> getLatestOffset api in SimpleConsumer to obtain either the current valid
>> smallest or largest offset and reconsume from there. Our high level
>> consumer does that for you (among many other things). That's why we
>> encourage most users to use the high level api instead.
>>
>> Thanks,
>>
>> Jun
>

Re: the cleaner and log segments

Posted by Taylor Gautier <tg...@tagged.com>.
Ok that's what we are already doing.  In essence when that happens it
is a bit like a rollover. Except depending on the values it might be
the case that a consumer has a low enough value that web it requests
the topic the value is still within range but is not valid since
messages were delivered to the broker. Essentially it's a race
condition that might be somewhat hard to induce but is theoretically
possible. During a rollover of 64-bits this is more or less never
going to happen because 64-bits is just too large to open a time
window long enough for the race to occur.



On Nov 18, 2011, at 10:32 AM, Jun Rao <ju...@gmail.com> wrote:

> Taylor,
>
> If you request an offset whose corresponding log file has been deleted, you
> will get OutOfRange exception. When this happens, you can use the
> getLatestOffset api in SimpleConsumer to obtain either the current valid
> smallest or largest offset and reconsume from there. Our high level
> consumer does that for you (among many other things). That's why we
> encourage most users to use the high level api instead.
>
> Thanks,
>
> Jun

Re: the cleaner and log segments

Posted by Jun Rao <ju...@gmail.com>.
Taylor,

If you request an offset whose corresponding log file has been deleted, you
will get OutOfRange exception. When this happens, you can use the
getLatestOffset api in SimpleConsumer to obtain either the current valid
smallest or largest offset and reconsume from there. Our high level
consumer does that for you (among many other things). That's why we
encourage most users to use the high level api instead.

Thanks,

Jun

Re: the cleaner and log segments

Posted by Taylor Gautier <tg...@tagged.com>.
Well interestingly enough I just checked the logs, and the problem I was
sort of thinking might happen already did.  Here it is:

[2011-11-18 09:31:52,255] INFO Deleting log segment
00000000000000016226.kafka from cards_card_1185934476-0
(kafka.log.LogManager)
[2011-11-18 09:31:52,255] WARN Delete failed. (kafka.log.LogManager)
[2011-11-18 09:31:52,255] INFO Deleting log segment
00000000000000000026.kafka from healthCheck1320643480188-0
(kafka.log.LogManager)
[2011-11-18 09:31:52,255] INFO Deleting log segment
00000000000000000028.kafka from healthCheck1319860947508-0
(kafka.log.LogManager)
[2011-11-18 09:31:52,255] ERROR error when processing request
topic:cards_card_1185934476, part:0 offset:16226 maxSize:1048576
kafka.common.OffsetOutOfRangeException: offset 16226 is out of
rangekafka.common.OffsetOutOfRangeException: offset 16226 is out of range
        at kafka.log.Log$.findRange(Log.scala:47)
        at kafka.log.Log.read(Log.scala:223)
        at
kafka.server.KafkaRequestHandlers.kafka$server$KafkaRequestHandlers$$readMessageSet(KafkaRequestHandlers.scala:125)
        at
kafka.server.KafkaRequestHandlers.handleFetchRequest(KafkaRequestHandlers.scala:107)
        at
kafka.server.KafkaRequestHandlers$$anonfun$handlerFor$2.apply(KafkaRequestHandlers.scala:42)
        at
kafka.server.KafkaRequestHandlers$$anonfun$handlerFor$2.apply(KafkaRequestHandlers.scala:42)
        at kafka.network.Processor.handle(SocketServer.scala:268)
        at kafka.network.Processor.read(SocketServer.scala:291)
        at kafka.network.Processor.run(SocketServer.scala:202)
        at java.lang.Thread.run(Thread.java:619)
 (kafka.server.KafkaRequestHandlers)


you see the issue?  The consumer had previously read messages up to offset
16226.  The cleaner came and took out the segment in the directory so there
are no more segments.  The consumer came and asked for the offset 16226 and
it's now invalid.  I had previously thought this might occur only after a
restart but it appears to happen even without a restart.


On Fri, Nov 18, 2011 at 8:52 AM, Taylor Gautier <tg...@tagged.com> wrote:

> Right. I'm talking about the broker. Where does it store what is the
> most recent offset if there are no log segments?  And no ZK.
>
>
>
> On Nov 18, 2011, at 8:50 AM, Jun Rao <ju...@gmail.com> wrote:
>
> > What I described is what happens in the broker. If you use
> SimpleConsumer,
> > then it's the consumer's responsibility to remember the last offset. The
> > server doesn't store the state for consumers.
> >
> > Thanks,
> >
> > Jun
> >
> > On Fri, Nov 18, 2011 at 8:19 AM, Taylor Gautier <tg...@tagged.com>
> wrote:
> >
> >> how?  where is the information kept?  If ZK is not around, and it's not
> on
> >> disk, how is this information passed to the next process after the
> restart?
> >>
> >> On Fri, Nov 18, 2011 at 8:04 AM, Jun Rao <ju...@gmail.com> wrote:
> >>
> >>> 4) is incorrect. "Last offset" remains to be 'a' even after the data is
> >>> cleaned. So in 5), the offset will be 2 x 'a'. That is, we never
> recycle
> >>> offsets. They keep increasing.
> >>>
> >>> Thanks,
> >>>
> >>> Jun
> >>>
> >>> On Fri, Nov 18, 2011 at 7:02 AM, Taylor Gautier <tg...@tagged.com>
> >>> wrote:
> >>>
> >>>> I don't use high level consumers - just low level.  What I was
> thinking
> >>> was
> >>>> the following.  Let's assume I have turned off ZK in my setup.
> >>>>
> >>>> 1) Send 1 message to topic A.  Kafka creates a directory and log
> >> segment
> >>>> for A.  The log segment starts at 0.   Now, the "last offset" of the
> >>> topic
> >>>> is a.
> >>>>
> >>>> 2) A consumer reads from topic A the message, and records that the
> most
> >>>> recent offset in topic A is a.
> >>>>
> >>>> 3) Much time passes, the cleaner runs, and deletes the log segment
> >>>>
> >>>> 4) More time passes, I restart Kafka.  Kafka sees the topic A
> >> directory,
> >>>> but has no segment file to initialize from.  So the "last offset" is
> >>>> considered to be 0.
> >>>>
> >>>> 5) Send 1 message to topic A.  Kafka creates a log segment for A
> >> starting
> >>>> at 0.   The new last offset of the topic is a'.
> >>>>
> >>>> 6) The consumer from step 2 tries to read from Kafka at offset a, but
> >>> this
> >>>> is now an invalid offset.
> >>>>
> >>>> Does that sound right?  I haven't tried this yet, I'm just doing a
> >>> thought
> >>>> experiment here to try to figure out what would happen.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Nov 17, 2011 at 11:01 PM, Jun Rao <ju...@gmail.com> wrote:
> >>>>
> >>>>> This is true for the high-level ZK-based consumer.
> >>>>>
> >>>>> Jun
> >>>>>
> >>>>> On Thu, Nov 17, 2011 at 10:59 PM, Inder Pall <in...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>> Jun & Taylor,
> >>>>>> would it be right to say that consumers without ZK won't be a
> >> viable
> >>>>> option
> >>>>>> if you can't handle replay of old messages in your application.
> >>>>>>
> >>>>>> - inder
> >>>>>>
> >>>>>> On Fri, Nov 18, 2011 at 12:27 PM, Jun Rao <ju...@gmail.com>
> >> wrote:
> >>>>>>
> >>>>>>> Taylor,
> >>>>>>>
> >>>>>>> When you start a consumer, it always tries to get the last
> >>>> checkpointed
> >>>>>>> offset from ZK. If no offset can be found in ZK, the consumer
> >>> starts
> >>>>> from
> >>>>>>> either the smallest or the largest available offset in the
> >> broker.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Jun
> >>>>>>>
> >>>>>>> On Thu, Nov 17, 2011 at 9:20 PM, Taylor Gautier <
> >>> tgautier@tagged.com
> >>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> hmmm - and if you turn off zookeeper?
> >>>>>>>>
> >>>>>>>> On Thu, Nov 17, 2011 at 9:15 PM, Inder Pall <
> >>> inder.pall@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> The consumer offsets are stored in ZooKeeper by topic and
> >>>>> partition.
> >>>>>>>>> That's how in a consumer fail over scenario you don't get
> >>>> duplicate
> >>>>>>>>> messages
> >>>>>>>>>
> >>>>>>>>> - Inder
> >>>>>>>>>
> >>>>>>>>> On Fri, Nov 18, 2011 at 10:33 AM, Taylor Gautier <
> >>>>>> tgautier@tagged.com
> >>>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> We've noticed that the cleaner script in Kafka removes
> >> empty
> >>>> log
> >>>>>>>> segments
> >>>>>>>>>> but not the directories themselves.  I am actually
> >> wondering
> >>>>>>> something
> >>>>>>>> -
> >>>>>>>>> I
> >>>>>>>>>> always assumed that Kafka could restore the latest offset
> >> for
> >>>>>>> existing
> >>>>>>>>>> topics by scanning the log directory for all directories
> >> and
> >>>>>> scanning
> >>>>>>>> the
> >>>>>>>>>> directories for log segment files to restore the latest
> >>> offset.
> >>>>>>>>>>
> >>>>>>>>>> Now this conclusion I have made simply by observation - so
> >> it
> >>>>> could
> >>>>>>> be
> >>>>>>>>>> entirely wrong.
> >>>>>>>>>>
> >>>>>>>>>> My question is however - if I am right, and the cleaner
> >>> removes
> >>>>> all
> >>>>>>> the
> >>>>>>>>> log
> >>>>>>>>>> segments for a given topic so that a given topic directory
> >> is
> >>>>>> empty,
> >>>>>>>> how
> >>>>>>>>>> does Kafka behave when restarted?  How does it know what
> >> the
> >>>> next
> >>>>>>>> offset
> >>>>>>>>>> should be?
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> -- Inder
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> -- Inder
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>

Re: the cleaner and log segments

Posted by Taylor Gautier <tg...@tagged.com>.
Right. I'm talking about the broker. Where does it store what is the
most recent offset if there are no log segments?  And no ZK.



On Nov 18, 2011, at 8:50 AM, Jun Rao <ju...@gmail.com> wrote:

> What I described is what happens in the broker. If you use SimpleConsumer,
> then it's the consumer's responsibility to remember the last offset. The
> server doesn't store the state for consumers.
>
> Thanks,
>
> Jun
>
> On Fri, Nov 18, 2011 at 8:19 AM, Taylor Gautier <tg...@tagged.com> wrote:
>
>> how?  where is the information kept?  If ZK is not around, and it's not on
>> disk, how is this information passed to the next process after the restart?
>>
>> On Fri, Nov 18, 2011 at 8:04 AM, Jun Rao <ju...@gmail.com> wrote:
>>
>>> 4) is incorrect. "Last offset" remains to be 'a' even after the data is
>>> cleaned. So in 5), the offset will be 2 x 'a'. That is, we never recycle
>>> offsets. They keep increasing.
>>>
>>> Thanks,
>>>
>>> Jun
>>>
>>> On Fri, Nov 18, 2011 at 7:02 AM, Taylor Gautier <tg...@tagged.com>
>>> wrote:
>>>
>>>> I don't use high level consumers - just low level.  What I was thinking
>>> was
>>>> the following.  Let's assume I have turned off ZK in my setup.
>>>>
>>>> 1) Send 1 message to topic A.  Kafka creates a directory and log
>> segment
>>>> for A.  The log segment starts at 0.   Now, the "last offset" of the
>>> topic
>>>> is a.
>>>>
>>>> 2) A consumer reads from topic A the message, and records that the most
>>>> recent offset in topic A is a.
>>>>
>>>> 3) Much time passes, the cleaner runs, and deletes the log segment
>>>>
>>>> 4) More time passes, I restart Kafka.  Kafka sees the topic A
>> directory,
>>>> but has no segment file to initialize from.  So the "last offset" is
>>>> considered to be 0.
>>>>
>>>> 5) Send 1 message to topic A.  Kafka creates a log segment for A
>> starting
>>>> at 0.   The new last offset of the topic is a'.
>>>>
>>>> 6) The consumer from step 2 tries to read from Kafka at offset a, but
>>> this
>>>> is now an invalid offset.
>>>>
>>>> Does that sound right?  I haven't tried this yet, I'm just doing a
>>> thought
>>>> experiment here to try to figure out what would happen.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Nov 17, 2011 at 11:01 PM, Jun Rao <ju...@gmail.com> wrote:
>>>>
>>>>> This is true for the high-level ZK-based consumer.
>>>>>
>>>>> Jun
>>>>>
>>>>> On Thu, Nov 17, 2011 at 10:59 PM, Inder Pall <in...@gmail.com>
>>>> wrote:
>>>>>
>>>>>> Jun & Taylor,
>>>>>> would it be right to say that consumers without ZK won't be a
>> viable
>>>>> option
>>>>>> if you can't handle replay of old messages in your application.
>>>>>>
>>>>>> - inder
>>>>>>
>>>>>> On Fri, Nov 18, 2011 at 12:27 PM, Jun Rao <ju...@gmail.com>
>> wrote:
>>>>>>
>>>>>>> Taylor,
>>>>>>>
>>>>>>> When you start a consumer, it always tries to get the last
>>>> checkpointed
>>>>>>> offset from ZK. If no offset can be found in ZK, the consumer
>>> starts
>>>>> from
>>>>>>> either the smallest or the largest available offset in the
>> broker.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Jun
>>>>>>>
>>>>>>> On Thu, Nov 17, 2011 at 9:20 PM, Taylor Gautier <
>>> tgautier@tagged.com
>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> hmmm - and if you turn off zookeeper?
>>>>>>>>
>>>>>>>> On Thu, Nov 17, 2011 at 9:15 PM, Inder Pall <
>>> inder.pall@gmail.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The consumer offsets are stored in ZooKeeper by topic and
>>>>> partition.
>>>>>>>>> That's how in a consumer fail over scenario you don't get
>>>> duplicate
>>>>>>>>> messages
>>>>>>>>>
>>>>>>>>> - Inder
>>>>>>>>>
>>>>>>>>> On Fri, Nov 18, 2011 at 10:33 AM, Taylor Gautier <
>>>>>> tgautier@tagged.com
>>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> We've noticed that the cleaner script in Kafka removes
>> empty
>>>> log
>>>>>>>> segments
>>>>>>>>>> but not the directories themselves.  I am actually
>> wondering
>>>>>>> something
>>>>>>>> -
>>>>>>>>> I
>>>>>>>>>> always assumed that Kafka could restore the latest offset
>> for
>>>>>>> existing
>>>>>>>>>> topics by scanning the log directory for all directories
>> and
>>>>>> scanning
>>>>>>>> the
>>>>>>>>>> directories for log segment files to restore the latest
>>> offset.
>>>>>>>>>>
>>>>>>>>>> Now this conclusion I have made simply by observation - so
>> it
>>>>> could
>>>>>>> be
>>>>>>>>>> entirely wrong.
>>>>>>>>>>
>>>>>>>>>> My question is however - if I am right, and the cleaner
>>> removes
>>>>> all
>>>>>>> the
>>>>>>>>> log
>>>>>>>>>> segments for a given topic so that a given topic directory
>> is
>>>>>> empty,
>>>>>>>> how
>>>>>>>>>> does Kafka behave when restarted?  How does it know what
>> the
>>>> next
>>>>>>>> offset
>>>>>>>>>> should be?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> -- Inder
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -- Inder
>>>>>>
>>>>>
>>>>
>>>
>>

Re: the cleaner and log segments

Posted by Jun Rao <ju...@gmail.com>.
What I described is what happens in the broker. If you use SimpleConsumer,
then it's the consumer's responsibility to remember the last offset. The
server doesn't store the state for consumers.

Thanks,

Jun

On Fri, Nov 18, 2011 at 8:19 AM, Taylor Gautier <tg...@tagged.com> wrote:

> how?  where is the information kept?  If ZK is not around, and it's not on
> disk, how is this information passed to the next process after the restart?
>
> On Fri, Nov 18, 2011 at 8:04 AM, Jun Rao <ju...@gmail.com> wrote:
>
> > 4) is incorrect. "Last offset" remains to be 'a' even after the data is
> > cleaned. So in 5), the offset will be 2 x 'a'. That is, we never recycle
> > offsets. They keep increasing.
> >
> > Thanks,
> >
> > Jun
> >
> > On Fri, Nov 18, 2011 at 7:02 AM, Taylor Gautier <tg...@tagged.com>
> > wrote:
> >
> > > I don't use high level consumers - just low level.  What I was thinking
> > was
> > > the following.  Let's assume I have turned off ZK in my setup.
> > >
> > > 1) Send 1 message to topic A.  Kafka creates a directory and log
> segment
> > > for A.  The log segment starts at 0.   Now, the "last offset" of the
> > topic
> > > is a.
> > >
> > > 2) A consumer reads from topic A the message, and records that the most
> > > recent offset in topic A is a.
> > >
> > > 3) Much time passes, the cleaner runs, and deletes the log segment
> > >
> > > 4) More time passes, I restart Kafka.  Kafka sees the topic A
> directory,
> > > but has no segment file to initialize from.  So the "last offset" is
> > > considered to be 0.
> > >
> > > 5) Send 1 message to topic A.  Kafka creates a log segment for A
> starting
> > > at 0.   The new last offset of the topic is a'.
> > >
> > > 6) The consumer from step 2 tries to read from Kafka at offset a, but
> > this
> > > is now an invalid offset.
> > >
> > > Does that sound right?  I haven't tried this yet, I'm just doing a
> > thought
> > > experiment here to try to figure out what would happen.
> > >
> > >
> > >
> > >
> > > On Thu, Nov 17, 2011 at 11:01 PM, Jun Rao <ju...@gmail.com> wrote:
> > >
> > > > This is true for the high-level ZK-based consumer.
> > > >
> > > > Jun
> > > >
> > > > On Thu, Nov 17, 2011 at 10:59 PM, Inder Pall <in...@gmail.com>
> > > wrote:
> > > >
> > > > > Jun & Taylor,
> > > > > would it be right to say that consumers without ZK won't be a
> viable
> > > > option
> > > > > if you can't handle replay of old messages in your application.
> > > > >
> > > > > - inder
> > > > >
> > > > > On Fri, Nov 18, 2011 at 12:27 PM, Jun Rao <ju...@gmail.com>
> wrote:
> > > > >
> > > > > > Taylor,
> > > > > >
> > > > > > When you start a consumer, it always tries to get the last
> > > checkpointed
> > > > > > offset from ZK. If no offset can be found in ZK, the consumer
> > starts
> > > > from
> > > > > > either the smallest or the largest available offset in the
> broker.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Thu, Nov 17, 2011 at 9:20 PM, Taylor Gautier <
> > tgautier@tagged.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > hmmm - and if you turn off zookeeper?
> > > > > > >
> > > > > > > On Thu, Nov 17, 2011 at 9:15 PM, Inder Pall <
> > inder.pall@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > The consumer offsets are stored in ZooKeeper by topic and
> > > > partition.
> > > > > > > > That's how in a consumer fail over scenario you don't get
> > > duplicate
> > > > > > > > messages
> > > > > > > >
> > > > > > > > - Inder
> > > > > > > >
> > > > > > > > On Fri, Nov 18, 2011 at 10:33 AM, Taylor Gautier <
> > > > > tgautier@tagged.com
> > > > > > > > >wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > We've noticed that the cleaner script in Kafka removes
> empty
> > > log
> > > > > > > segments
> > > > > > > > > but not the directories themselves.  I am actually
> wondering
> > > > > > something
> > > > > > > -
> > > > > > > > I
> > > > > > > > > always assumed that Kafka could restore the latest offset
> for
> > > > > > existing
> > > > > > > > > topics by scanning the log directory for all directories
> and
> > > > > scanning
> > > > > > > the
> > > > > > > > > directories for log segment files to restore the latest
> > offset.
> > > > > > > > >
> > > > > > > > > Now this conclusion I have made simply by observation - so
> it
> > > > could
> > > > > > be
> > > > > > > > > entirely wrong.
> > > > > > > > >
> > > > > > > > > My question is however - if I am right, and the cleaner
> > removes
> > > > all
> > > > > > the
> > > > > > > > log
> > > > > > > > > segments for a given topic so that a given topic directory
> is
> > > > > empty,
> > > > > > > how
> > > > > > > > > does Kafka behave when restarted?  How does it know what
> the
> > > next
> > > > > > > offset
> > > > > > > > > should be?
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > -- Inder
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Inder
> > > > >
> > > >
> > >
> >
>

Re: the cleaner and log segments

Posted by Taylor Gautier <tg...@tagged.com>.
how?  where is the information kept?  If ZK is not around, and it's not on
disk, how is this information passed to the next process after the restart?

On Fri, Nov 18, 2011 at 8:04 AM, Jun Rao <ju...@gmail.com> wrote:

> 4) is incorrect. "Last offset" remains to be 'a' even after the data is
> cleaned. So in 5), the offset will be 2 x 'a'. That is, we never recycle
> offsets. They keep increasing.
>
> Thanks,
>
> Jun
>
> On Fri, Nov 18, 2011 at 7:02 AM, Taylor Gautier <tg...@tagged.com>
> wrote:
>
> > I don't use high level consumers - just low level.  What I was thinking
> was
> > the following.  Let's assume I have turned off ZK in my setup.
> >
> > 1) Send 1 message to topic A.  Kafka creates a directory and log segment
> > for A.  The log segment starts at 0.   Now, the "last offset" of the
> topic
> > is a.
> >
> > 2) A consumer reads from topic A the message, and records that the most
> > recent offset in topic A is a.
> >
> > 3) Much time passes, the cleaner runs, and deletes the log segment
> >
> > 4) More time passes, I restart Kafka.  Kafka sees the topic A directory,
> > but has no segment file to initialize from.  So the "last offset" is
> > considered to be 0.
> >
> > 5) Send 1 message to topic A.  Kafka creates a log segment for A starting
> > at 0.   The new last offset of the topic is a'.
> >
> > 6) The consumer from step 2 tries to read from Kafka at offset a, but
> this
> > is now an invalid offset.
> >
> > Does that sound right?  I haven't tried this yet, I'm just doing a
> thought
> > experiment here to try to figure out what would happen.
> >
> >
> >
> >
> > On Thu, Nov 17, 2011 at 11:01 PM, Jun Rao <ju...@gmail.com> wrote:
> >
> > > This is true for the high-level ZK-based consumer.
> > >
> > > Jun
> > >
> > > On Thu, Nov 17, 2011 at 10:59 PM, Inder Pall <in...@gmail.com>
> > wrote:
> > >
> > > > Jun & Taylor,
> > > > would it be right to say that consumers without ZK won't be a viable
> > > option
> > > > if you can't handle replay of old messages in your application.
> > > >
> > > > - inder
> > > >
> > > > On Fri, Nov 18, 2011 at 12:27 PM, Jun Rao <ju...@gmail.com> wrote:
> > > >
> > > > > Taylor,
> > > > >
> > > > > When you start a consumer, it always tries to get the last
> > checkpointed
> > > > > offset from ZK. If no offset can be found in ZK, the consumer
> starts
> > > from
> > > > > either the smallest or the largest available offset in the broker.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Thu, Nov 17, 2011 at 9:20 PM, Taylor Gautier <
> tgautier@tagged.com
> > >
> > > > > wrote:
> > > > >
> > > > > > hmmm - and if you turn off zookeeper?
> > > > > >
> > > > > > On Thu, Nov 17, 2011 at 9:15 PM, Inder Pall <
> inder.pall@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > The consumer offsets are stored in ZooKeeper by topic and
> > > partition.
> > > > > > > That's how in a consumer fail over scenario you don't get
> > duplicate
> > > > > > > messages
> > > > > > >
> > > > > > > - Inder
> > > > > > >
> > > > > > > On Fri, Nov 18, 2011 at 10:33 AM, Taylor Gautier <
> > > > tgautier@tagged.com
> > > > > > > >wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > We've noticed that the cleaner script in Kafka removes empty
> > log
> > > > > > segments
> > > > > > > > but not the directories themselves.  I am actually wondering
> > > > > something
> > > > > > -
> > > > > > > I
> > > > > > > > always assumed that Kafka could restore the latest offset for
> > > > > existing
> > > > > > > > topics by scanning the log directory for all directories and
> > > > scanning
> > > > > > the
> > > > > > > > directories for log segment files to restore the latest
> offset.
> > > > > > > >
> > > > > > > > Now this conclusion I have made simply by observation - so it
> > > could
> > > > > be
> > > > > > > > entirely wrong.
> > > > > > > >
> > > > > > > > My question is however - if I am right, and the cleaner
> removes
> > > all
> > > > > the
> > > > > > > log
> > > > > > > > segments for a given topic so that a given topic directory is
> > > > empty,
> > > > > > how
> > > > > > > > does Kafka behave when restarted?  How does it know what the
> > next
> > > > > > offset
> > > > > > > > should be?
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > -- Inder
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > -- Inder
> > > >
> > >
> >
>

Re: the cleaner and log segments

Posted by Jun Rao <ju...@gmail.com>.
Evan,

We don't roll back offset at this moment. Since the offset is a long, it
can last for a really long time. If you write 1TB a day, you can keep going
for about 4 million days.

Plus, you can always use more partitions (each partition has its own
offset).

Thanks,

Jun

On Fri, Nov 18, 2011 at 9:40 AM, Evan Chan <ev...@ooyala.com> wrote:

> Jun,
>
> How do offsets keep increasing?  Eventually they have to rollover back to
> 0, right?    What happens if Kafka runs for months, eventually the offset
> rolls back, right?
>
> On Fri, Nov 18, 2011 at 8:04 AM, Jun Rao <ju...@gmail.com> wrote:
>
> > 4) is incorrect. "Last offset" remains to be 'a' even after the data is
> > cleaned. So in 5), the offset will be 2 x 'a'. That is, we never recycle
> > offsets. They keep increasing.
> >
> > Thanks,
> >
> > Jun
> >
> > On Fri, Nov 18, 2011 at 7:02 AM, Taylor Gautier <tg...@tagged.com>
> > wrote:
> >
> > > I don't use high level consumers - just low level.  What I was thinking
> > was
> > > the following.  Let's assume I have turned off ZK in my setup.
> > >
> > > 1) Send 1 message to topic A.  Kafka creates a directory and log
> segment
> > > for A.  The log segment starts at 0.   Now, the "last offset" of the
> > topic
> > > is a.
> > >
> > > 2) A consumer reads from topic A the message, and records that the most
> > > recent offset in topic A is a.
> > >
> > > 3) Much time passes, the cleaner runs, and deletes the log segment
> > >
> > > 4) More time passes, I restart Kafka.  Kafka sees the topic A
> directory,
> > > but has no segment file to initialize from.  So the "last offset" is
> > > considered to be 0.
> > >
> > > 5) Send 1 message to topic A.  Kafka creates a log segment for A
> starting
> > > at 0.   The new last offset of the topic is a'.
> > >
> > > 6) The consumer from step 2 tries to read from Kafka at offset a, but
> > this
> > > is now an invalid offset.
> > >
> > > Does that sound right?  I haven't tried this yet, I'm just doing a
> > thought
> > > experiment here to try to figure out what would happen.
> > >
> > >
> > >
> > >
> > > On Thu, Nov 17, 2011 at 11:01 PM, Jun Rao <ju...@gmail.com> wrote:
> > >
> > > > This is true for the high-level ZK-based consumer.
> > > >
> > > > Jun
> > > >
> > > > On Thu, Nov 17, 2011 at 10:59 PM, Inder Pall <in...@gmail.com>
> > > wrote:
> > > >
> > > > > Jun & Taylor,
> > > > > would it be right to say that consumers without ZK won't be a
> viable
> > > > option
> > > > > if you can't handle replay of old messages in your application.
> > > > >
> > > > > - inder
> > > > >
> > > > > On Fri, Nov 18, 2011 at 12:27 PM, Jun Rao <ju...@gmail.com>
> wrote:
> > > > >
> > > > > > Taylor,
> > > > > >
> > > > > > When you start a consumer, it always tries to get the last
> > > checkpointed
> > > > > > offset from ZK. If no offset can be found in ZK, the consumer
> > starts
> > > > from
> > > > > > either the smallest or the largest available offset in the
> broker.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Thu, Nov 17, 2011 at 9:20 PM, Taylor Gautier <
> > tgautier@tagged.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > hmmm - and if you turn off zookeeper?
> > > > > > >
> > > > > > > On Thu, Nov 17, 2011 at 9:15 PM, Inder Pall <
> > inder.pall@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > The consumer offsets are stored in ZooKeeper by topic and
> > > > partition.
> > > > > > > > That's how in a consumer fail over scenario you don't get
> > > duplicate
> > > > > > > > messages
> > > > > > > >
> > > > > > > > - Inder
> > > > > > > >
> > > > > > > > On Fri, Nov 18, 2011 at 10:33 AM, Taylor Gautier <
> > > > > tgautier@tagged.com
> > > > > > > > >wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > We've noticed that the cleaner script in Kafka removes
> empty
> > > log
> > > > > > > segments
> > > > > > > > > but not the directories themselves.  I am actually
> wondering
> > > > > > something
> > > > > > > -
> > > > > > > > I
> > > > > > > > > always assumed that Kafka could restore the latest offset
> for
> > > > > > existing
> > > > > > > > > topics by scanning the log directory for all directories
> and
> > > > > scanning
> > > > > > > the
> > > > > > > > > directories for log segment files to restore the latest
> > offset.
> > > > > > > > >
> > > > > > > > > Now this conclusion I have made simply by observation - so
> it
> > > > could
> > > > > > be
> > > > > > > > > entirely wrong.
> > > > > > > > >
> > > > > > > > > My question is however - if I am right, and the cleaner
> > removes
> > > > all
> > > > > > the
> > > > > > > > log
> > > > > > > > > segments for a given topic so that a given topic directory
> is
> > > > > empty,
> > > > > > > how
> > > > > > > > > does Kafka behave when restarted?  How does it know what
> the
> > > next
> > > > > > > offset
> > > > > > > > > should be?
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > -- Inder
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Inder
> > > > >
> > > >
> > >
> >
>
>
>
> --
> --
> *Evan Chan*
> Senior Software Engineer |
> ev@ooyala.com | (650) 996-4600
> www.ooyala.com | blog <http://www.ooyala.com/blog> |
> @ooyala<http://www.twitter.com/ooyala>
>

Re: the cleaner and log segments

Posted by Evan Chan <ev...@ooyala.com>.
Jun,

How do offsets keep increasing?  Eventually they have to rollover back to
0, right?    What happens if Kafka runs for months, eventually the offset
rolls back, right?

On Fri, Nov 18, 2011 at 8:04 AM, Jun Rao <ju...@gmail.com> wrote:

> 4) is incorrect. "Last offset" remains to be 'a' even after the data is
> cleaned. So in 5), the offset will be 2 x 'a'. That is, we never recycle
> offsets. They keep increasing.
>
> Thanks,
>
> Jun
>
> On Fri, Nov 18, 2011 at 7:02 AM, Taylor Gautier <tg...@tagged.com>
> wrote:
>
> > I don't use high level consumers - just low level.  What I was thinking
> was
> > the following.  Let's assume I have turned off ZK in my setup.
> >
> > 1) Send 1 message to topic A.  Kafka creates a directory and log segment
> > for A.  The log segment starts at 0.   Now, the "last offset" of the
> topic
> > is a.
> >
> > 2) A consumer reads from topic A the message, and records that the most
> > recent offset in topic A is a.
> >
> > 3) Much time passes, the cleaner runs, and deletes the log segment
> >
> > 4) More time passes, I restart Kafka.  Kafka sees the topic A directory,
> > but has no segment file to initialize from.  So the "last offset" is
> > considered to be 0.
> >
> > 5) Send 1 message to topic A.  Kafka creates a log segment for A starting
> > at 0.   The new last offset of the topic is a'.
> >
> > 6) The consumer from step 2 tries to read from Kafka at offset a, but
> this
> > is now an invalid offset.
> >
> > Does that sound right?  I haven't tried this yet, I'm just doing a
> thought
> > experiment here to try to figure out what would happen.
> >
> >
> >
> >
> > On Thu, Nov 17, 2011 at 11:01 PM, Jun Rao <ju...@gmail.com> wrote:
> >
> > > This is true for the high-level ZK-based consumer.
> > >
> > > Jun
> > >
> > > On Thu, Nov 17, 2011 at 10:59 PM, Inder Pall <in...@gmail.com>
> > wrote:
> > >
> > > > Jun & Taylor,
> > > > would it be right to say that consumers without ZK won't be a viable
> > > option
> > > > if you can't handle replay of old messages in your application.
> > > >
> > > > - inder
> > > >
> > > > On Fri, Nov 18, 2011 at 12:27 PM, Jun Rao <ju...@gmail.com> wrote:
> > > >
> > > > > Taylor,
> > > > >
> > > > > When you start a consumer, it always tries to get the last
> > checkpointed
> > > > > offset from ZK. If no offset can be found in ZK, the consumer
> starts
> > > from
> > > > > either the smallest or the largest available offset in the broker.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Thu, Nov 17, 2011 at 9:20 PM, Taylor Gautier <
> tgautier@tagged.com
> > >
> > > > > wrote:
> > > > >
> > > > > > hmmm - and if you turn off zookeeper?
> > > > > >
> > > > > > On Thu, Nov 17, 2011 at 9:15 PM, Inder Pall <
> inder.pall@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > The consumer offsets are stored in ZooKeeper by topic and
> > > partition.
> > > > > > > That's how in a consumer fail over scenario you don't get
> > duplicate
> > > > > > > messages
> > > > > > >
> > > > > > > - Inder
> > > > > > >
> > > > > > > On Fri, Nov 18, 2011 at 10:33 AM, Taylor Gautier <
> > > > tgautier@tagged.com
> > > > > > > >wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > We've noticed that the cleaner script in Kafka removes empty
> > log
> > > > > > segments
> > > > > > > > but not the directories themselves.  I am actually wondering
> > > > > something
> > > > > > -
> > > > > > > I
> > > > > > > > always assumed that Kafka could restore the latest offset for
> > > > > existing
> > > > > > > > topics by scanning the log directory for all directories and
> > > > scanning
> > > > > > the
> > > > > > > > directories for log segment files to restore the latest
> offset.
> > > > > > > >
> > > > > > > > Now this conclusion I have made simply by observation - so it
> > > could
> > > > > be
> > > > > > > > entirely wrong.
> > > > > > > >
> > > > > > > > My question is however - if I am right, and the cleaner
> removes
> > > all
> > > > > the
> > > > > > > log
> > > > > > > > segments for a given topic so that a given topic directory is
> > > > empty,
> > > > > > how
> > > > > > > > does Kafka behave when restarted?  How does it know what the
> > next
> > > > > > offset
> > > > > > > > should be?
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > -- Inder
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > -- Inder
> > > >
> > >
> >
>



-- 
--
*Evan Chan*
Senior Software Engineer |
ev@ooyala.com | (650) 996-4600
www.ooyala.com | blog <http://www.ooyala.com/blog> |
@ooyala<http://www.twitter.com/ooyala>

Re: the cleaner and log segments

Posted by Jun Rao <ju...@gmail.com>.
4) is incorrect. "Last offset" remains to be 'a' even after the data is
cleaned. So in 5), the offset will be 2 x 'a'. That is, we never recycle
offsets. They keep increasing.

Thanks,

Jun

On Fri, Nov 18, 2011 at 7:02 AM, Taylor Gautier <tg...@tagged.com> wrote:

> I don't use high level consumers - just low level.  What I was thinking was
> the following.  Let's assume I have turned off ZK in my setup.
>
> 1) Send 1 message to topic A.  Kafka creates a directory and log segment
> for A.  The log segment starts at 0.   Now, the "last offset" of the topic
> is a.
>
> 2) A consumer reads from topic A the message, and records that the most
> recent offset in topic A is a.
>
> 3) Much time passes, the cleaner runs, and deletes the log segment
>
> 4) More time passes, I restart Kafka.  Kafka sees the topic A directory,
> but has no segment file to initialize from.  So the "last offset" is
> considered to be 0.
>
> 5) Send 1 message to topic A.  Kafka creates a log segment for A starting
> at 0.   The new last offset of the topic is a'.
>
> 6) The consumer from step 2 tries to read from Kafka at offset a, but this
> is now an invalid offset.
>
> Does that sound right?  I haven't tried this yet, I'm just doing a thought
> experiment here to try to figure out what would happen.
>
>
>
>
> On Thu, Nov 17, 2011 at 11:01 PM, Jun Rao <ju...@gmail.com> wrote:
>
> > This is true for the high-level ZK-based consumer.
> >
> > Jun
> >
> > On Thu, Nov 17, 2011 at 10:59 PM, Inder Pall <in...@gmail.com>
> wrote:
> >
> > > Jun & Taylor,
> > > would it be right to say that consumers without ZK won't be a viable
> > option
> > > if you can't handle replay of old messages in your application.
> > >
> > > - inder
> > >
> > > On Fri, Nov 18, 2011 at 12:27 PM, Jun Rao <ju...@gmail.com> wrote:
> > >
> > > > Taylor,
> > > >
> > > > When you start a consumer, it always tries to get the last
> checkpointed
> > > > offset from ZK. If no offset can be found in ZK, the consumer starts
> > from
> > > > either the smallest or the largest available offset in the broker.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Thu, Nov 17, 2011 at 9:20 PM, Taylor Gautier <tgautier@tagged.com
> >
> > > > wrote:
> > > >
> > > > > hmmm - and if you turn off zookeeper?
> > > > >
> > > > > On Thu, Nov 17, 2011 at 9:15 PM, Inder Pall <in...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > The consumer offsets are stored in ZooKeeper by topic and
> > partition.
> > > > > > That's how in a consumer fail over scenario you don't get
> duplicate
> > > > > > messages
> > > > > >
> > > > > > - Inder
> > > > > >
> > > > > > On Fri, Nov 18, 2011 at 10:33 AM, Taylor Gautier <
> > > tgautier@tagged.com
> > > > > > >wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > We've noticed that the cleaner script in Kafka removes empty
> log
> > > > > segments
> > > > > > > but not the directories themselves.  I am actually wondering
> > > > something
> > > > > -
> > > > > > I
> > > > > > > always assumed that Kafka could restore the latest offset for
> > > > existing
> > > > > > > topics by scanning the log directory for all directories and
> > > scanning
> > > > > the
> > > > > > > directories for log segment files to restore the latest offset.
> > > > > > >
> > > > > > > Now this conclusion I have made simply by observation - so it
> > could
> > > > be
> > > > > > > entirely wrong.
> > > > > > >
> > > > > > > My question is however - if I am right, and the cleaner removes
> > all
> > > > the
> > > > > > log
> > > > > > > segments for a given topic so that a given topic directory is
> > > empty,
> > > > > how
> > > > > > > does Kafka behave when restarted?  How does it know what the
> next
> > > > > offset
> > > > > > > should be?
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > -- Inder
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > -- Inder
> > >
> >
>

Re: the cleaner and log segments

Posted by Taylor Gautier <tg...@tagged.com>.
I don't use high level consumers - just low level.  What I was thinking was
the following.  Let's assume I have turned off ZK in my setup.

1) Send 1 message to topic A.  Kafka creates a directory and log segment
for A.  The log segment starts at 0.   Now, the "last offset" of the topic
is a.

2) A consumer reads from topic A the message, and records that the most
recent offset in topic A is a.

3) Much time passes, the cleaner runs, and deletes the log segment

4) More time passes, I restart Kafka.  Kafka sees the topic A directory,
but has no segment file to initialize from.  So the "last offset" is
considered to be 0.

5) Send 1 message to topic A.  Kafka creates a log segment for A starting
at 0.   The new last offset of the topic is a'.

6) The consumer from step 2 tries to read from Kafka at offset a, but this
is now an invalid offset.

Does that sound right?  I haven't tried this yet, I'm just doing a thought
experiment here to try to figure out what would happen.




On Thu, Nov 17, 2011 at 11:01 PM, Jun Rao <ju...@gmail.com> wrote:

> This is true for the high-level ZK-based consumer.
>
> Jun
>
> On Thu, Nov 17, 2011 at 10:59 PM, Inder Pall <in...@gmail.com> wrote:
>
> > Jun & Taylor,
> > would it be right to say that consumers without ZK won't be a viable
> option
> > if you can't handle replay of old messages in your application.
> >
> > - inder
> >
> > On Fri, Nov 18, 2011 at 12:27 PM, Jun Rao <ju...@gmail.com> wrote:
> >
> > > Taylor,
> > >
> > > When you start a consumer, it always tries to get the last checkpointed
> > > offset from ZK. If no offset can be found in ZK, the consumer starts
> from
> > > either the smallest or the largest available offset in the broker.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Nov 17, 2011 at 9:20 PM, Taylor Gautier <tg...@tagged.com>
> > > wrote:
> > >
> > > > hmmm - and if you turn off zookeeper?
> > > >
> > > > On Thu, Nov 17, 2011 at 9:15 PM, Inder Pall <in...@gmail.com>
> > > wrote:
> > > >
> > > > > The consumer offsets are stored in ZooKeeper by topic and
> partition.
> > > > > That's how in a consumer fail over scenario you don't get duplicate
> > > > > messages
> > > > >
> > > > > - Inder
> > > > >
> > > > > On Fri, Nov 18, 2011 at 10:33 AM, Taylor Gautier <
> > tgautier@tagged.com
> > > > > >wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > We've noticed that the cleaner script in Kafka removes empty log
> > > > segments
> > > > > > but not the directories themselves.  I am actually wondering
> > > something
> > > > -
> > > > > I
> > > > > > always assumed that Kafka could restore the latest offset for
> > > existing
> > > > > > topics by scanning the log directory for all directories and
> > scanning
> > > > the
> > > > > > directories for log segment files to restore the latest offset.
> > > > > >
> > > > > > Now this conclusion I have made simply by observation - so it
> could
> > > be
> > > > > > entirely wrong.
> > > > > >
> > > > > > My question is however - if I am right, and the cleaner removes
> all
> > > the
> > > > > log
> > > > > > segments for a given topic so that a given topic directory is
> > empty,
> > > > how
> > > > > > does Kafka behave when restarted?  How does it know what the next
> > > > offset
> > > > > > should be?
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Inder
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > -- Inder
> >
>

Re: the cleaner and log segments

Posted by Jun Rao <ju...@gmail.com>.
This is true for the high-level ZK-based consumer.

Jun

On Thu, Nov 17, 2011 at 10:59 PM, Inder Pall <in...@gmail.com> wrote:

> Jun & Taylor,
> would it be right to say that consumers without ZK won't be a viable option
> if you can't handle replay of old messages in your application.
>
> - inder
>
> On Fri, Nov 18, 2011 at 12:27 PM, Jun Rao <ju...@gmail.com> wrote:
>
> > Taylor,
> >
> > When you start a consumer, it always tries to get the last checkpointed
> > offset from ZK. If no offset can be found in ZK, the consumer starts from
> > either the smallest or the largest available offset in the broker.
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Nov 17, 2011 at 9:20 PM, Taylor Gautier <tg...@tagged.com>
> > wrote:
> >
> > > hmmm - and if you turn off zookeeper?
> > >
> > > On Thu, Nov 17, 2011 at 9:15 PM, Inder Pall <in...@gmail.com>
> > wrote:
> > >
> > > > The consumer offsets are stored in ZooKeeper by topic and partition.
> > > > That's how in a consumer fail over scenario you don't get duplicate
> > > > messages
> > > >
> > > > - Inder
> > > >
> > > > On Fri, Nov 18, 2011 at 10:33 AM, Taylor Gautier <
> tgautier@tagged.com
> > > > >wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > We've noticed that the cleaner script in Kafka removes empty log
> > > segments
> > > > > but not the directories themselves.  I am actually wondering
> > something
> > > -
> > > > I
> > > > > always assumed that Kafka could restore the latest offset for
> > existing
> > > > > topics by scanning the log directory for all directories and
> scanning
> > > the
> > > > > directories for log segment files to restore the latest offset.
> > > > >
> > > > > Now this conclusion I have made simply by observation - so it could
> > be
> > > > > entirely wrong.
> > > > >
> > > > > My question is however - if I am right, and the cleaner removes all
> > the
> > > > log
> > > > > segments for a given topic so that a given topic directory is
> empty,
> > > how
> > > > > does Kafka behave when restarted?  How does it know what the next
> > > offset
> > > > > should be?
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > -- Inder
> > > >
> > >
> >
>
>
>
> --
> -- Inder
>

Re: the cleaner and log segments

Posted by Inder Pall <in...@gmail.com>.
Jun & Taylor,
would it be right to say that consumers without ZK won't be a viable option
if you can't handle replay of old messages in your application.

- inder

On Fri, Nov 18, 2011 at 12:27 PM, Jun Rao <ju...@gmail.com> wrote:

> Taylor,
>
> When you start a consumer, it always tries to get the last checkpointed
> offset from ZK. If no offset can be found in ZK, the consumer starts from
> either the smallest or the largest available offset in the broker.
>
> Thanks,
>
> Jun
>
> On Thu, Nov 17, 2011 at 9:20 PM, Taylor Gautier <tg...@tagged.com>
> wrote:
>
> > hmmm - and if you turn off zookeeper?
> >
> > On Thu, Nov 17, 2011 at 9:15 PM, Inder Pall <in...@gmail.com>
> wrote:
> >
> > > The consumer offsets are stored in ZooKeeper by topic and partition.
> > > That's how in a consumer fail over scenario you don't get duplicate
> > > messages
> > >
> > > - Inder
> > >
> > > On Fri, Nov 18, 2011 at 10:33 AM, Taylor Gautier <tgautier@tagged.com
> > > >wrote:
> > >
> > > > Hi,
> > > >
> > > > We've noticed that the cleaner script in Kafka removes empty log
> > segments
> > > > but not the directories themselves.  I am actually wondering
> something
> > -
> > > I
> > > > always assumed that Kafka could restore the latest offset for
> existing
> > > > topics by scanning the log directory for all directories and scanning
> > the
> > > > directories for log segment files to restore the latest offset.
> > > >
> > > > Now this conclusion I have made simply by observation - so it could
> be
> > > > entirely wrong.
> > > >
> > > > My question is however - if I am right, and the cleaner removes all
> the
> > > log
> > > > segments for a given topic so that a given topic directory is empty,
> > how
> > > > does Kafka behave when restarted?  How does it know what the next
> > offset
> > > > should be?
> > > >
> > >
> > >
> > >
> > > --
> > > -- Inder
> > >
> >
>



-- 
-- Inder

Re: the cleaner and log segments

Posted by Jun Rao <ju...@gmail.com>.
Taylor,

When you start a consumer, it always tries to get the last checkpointed
offset from ZK. If no offset can be found in ZK, the consumer starts from
either the smallest or the largest available offset in the broker.

Thanks,

Jun

On Thu, Nov 17, 2011 at 9:20 PM, Taylor Gautier <tg...@tagged.com> wrote:

> hmmm - and if you turn off zookeeper?
>
> On Thu, Nov 17, 2011 at 9:15 PM, Inder Pall <in...@gmail.com> wrote:
>
> > The consumer offsets are stored in ZooKeeper by topic and partition.
> > That's how in a consumer fail over scenario you don't get duplicate
> > messages
> >
> > - Inder
> >
> > On Fri, Nov 18, 2011 at 10:33 AM, Taylor Gautier <tgautier@tagged.com
> > >wrote:
> >
> > > Hi,
> > >
> > > We've noticed that the cleaner script in Kafka removes empty log
> segments
> > > but not the directories themselves.  I am actually wondering something
> -
> > I
> > > always assumed that Kafka could restore the latest offset for existing
> > > topics by scanning the log directory for all directories and scanning
> the
> > > directories for log segment files to restore the latest offset.
> > >
> > > Now this conclusion I have made simply by observation - so it could be
> > > entirely wrong.
> > >
> > > My question is however - if I am right, and the cleaner removes all the
> > log
> > > segments for a given topic so that a given topic directory is empty,
> how
> > > does Kafka behave when restarted?  How does it know what the next
> offset
> > > should be?
> > >
> >
> >
> >
> > --
> > -- Inder
> >
>

Re: the cleaner and log segments

Posted by Taylor Gautier <tg...@tagged.com>.
hmmm - and if you turn off zookeeper?

On Thu, Nov 17, 2011 at 9:15 PM, Inder Pall <in...@gmail.com> wrote:

> The consumer offsets are stored in ZooKeeper by topic and partition.
> That's how in a consumer fail over scenario you don't get duplicate
> messages
>
> - Inder
>
> On Fri, Nov 18, 2011 at 10:33 AM, Taylor Gautier <tgautier@tagged.com
> >wrote:
>
> > Hi,
> >
> > We've noticed that the cleaner script in Kafka removes empty log segments
> > but not the directories themselves.  I am actually wondering something -
> I
> > always assumed that Kafka could restore the latest offset for existing
> > topics by scanning the log directory for all directories and scanning the
> > directories for log segment files to restore the latest offset.
> >
> > Now this conclusion I have made simply by observation - so it could be
> > entirely wrong.
> >
> > My question is however - if I am right, and the cleaner removes all the
> log
> > segments for a given topic so that a given topic directory is empty, how
> > does Kafka behave when restarted?  How does it know what the next offset
> > should be?
> >
>
>
>
> --
> -- Inder
>

Re: the cleaner and log segments

Posted by Inder Pall <in...@gmail.com>.
The consumer offsets are stored in ZooKeeper by topic and partition.
That's how in a consumer fail over scenario you don't get duplicate messages

- Inder

On Fri, Nov 18, 2011 at 10:33 AM, Taylor Gautier <tg...@tagged.com>wrote:

> Hi,
>
> We've noticed that the cleaner script in Kafka removes empty log segments
> but not the directories themselves.  I am actually wondering something - I
> always assumed that Kafka could restore the latest offset for existing
> topics by scanning the log directory for all directories and scanning the
> directories for log segment files to restore the latest offset.
>
> Now this conclusion I have made simply by observation - so it could be
> entirely wrong.
>
> My question is however - if I am right, and the cleaner removes all the log
> segments for a given topic so that a given topic directory is empty, how
> does Kafka behave when restarted?  How does it know what the next offset
> should be?
>



-- 
-- Inder