You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Han JU <ju...@gmail.com> on 2016/02/01 18:59:39 UTC

Message lost after consumer crash in kafka 0.9

Hi,

One of our usage of kafka is to tolerate arbitrary consumer crash without
losing or duplicating messages. So in our code we manually commit offset
after successfully persisted the consumer state.

In prototyping with kafka-0.9's new consumer API, I found that in some
cases, kafka failed to send a part of messages to the consumers even if the
offsets are handled correctly.

I've made sure that this time everything is latest on 0.9.0 branch
(d1ff6c7) for both broker and client code.

Test code snippet is here:
  https://gist.github.com/darkjh/437ac72cdd4b1c4ca2e7

Test setup:
  - 12 partitions
  - 2 consumer app process with 2 consumer thread each
  - producer produces exactly 1.2M messages in about 2 minutes (enough time
for us to manual kill -9 consumer)
  - a consumer thread commits offset on each 80k messages received (to
simulate our regularly offset commit)
  - after all messages are consumed, each consumer thread will write a
number in file indicating how much message it has received. So all numbers
should sum to exactly 1.2M if everything goes well

Test run:
  - run the producer
  - run the 2 consumer app process in the same time
  - wait for the first commit offset (first 80k messages received in each
consumer thread)
  - after the first commit offset, kill -9 one of the consumer app
  - let another consumer app runs till messages are finished
  - check the files written by the remaining consumer threads

And after that, by checking the file, we do not receive 1.2M message but
roughly at 1.04M. The lag on kafka of this topic is 0.
If you check the logs of the consumer app with DEBUG level, you'll find out
that the offsets are correctly handled. 30s (default timeout) after the
kill -9 of one consumer app, the remaining consumer app correctly gets
assigned all the partitions and it starts right from the offsets that the
crashed consumer has previously committed. So this makes the message lost
quite mysterious for us.
Note that the kill -9 moment is important. If we kill -9 one consumer app
*before* the first commit offset, everything goes well. All messages
received, no lost. But when killed *after* the first commit offset,
there'll be messages lost.

Hope the code is clear to reproduce the problem. I'm available for any
further details needed.

Thanks!
-- 
*JU Han*

Software Engineer @ Teads.tv

+33 0619608888

Re: Message lost after consumer crash in kafka 0.9

Posted by Guozhang Wang <wa...@gmail.com>.
It is indeed wired that if you kill -9 before the first commit, then there
is no data loss.

But with what I suspect you can get data loss in the middle, not only the
last messages. Since once consumer1 is killed, consumer2 will take over
partitions assigned to consumer1 and resume from committed offsets, from my
examples it will start from message 100, while consumer1 has just processed
messages up to 50.

Guozhang

On Tue, Feb 2, 2016 at 7:21 AM, Han JU <ju...@gmail.com> wrote:

> Sorry in fact the test code in gist does not exactly reproduce the problem
> we're facing. I'm working on that.
>
> 2016-02-02 10:46 GMT+01:00 Han JU <ju...@gmail.com>:
>
> > Thanks Guazhang for the reply!
> >
> > So in fact if it's the case you said, if I understand correctly, then the
> > messages lost should be the last messages. But in our use case it is not
> > the last messages get lost. And this does not explain that the different
> > behavior depending on `kill -9` moment (before a commit or after a
> commit).
> > If a consumer app is killed before the first flush/commit then every
> > messages is received correctly.
> >
> > For the messages lost, our real app code flushes state and commits offset
> > regularly (say for 15m). In my test, say I've 45m's data, so I'll have 2
> > flush/commit point and 3 trunk of flushed data. If a consumer app process
> > is `kill -9` after the first flush/commit point and I let the remaining
> app
> > runs till the end. I got message lost only in the second trunk. Both
> first
> > and third trunk are perfectly handled.
> >
> > 2016-02-02 0:18 GMT+01:00 Guozhang Wang <wa...@gmail.com>:
> >
> >> One thing to add, is that by doing this you could possibly get
> duplicates
> >> but not data loss, which obeys Kafka's at-least once semantics.
> >>
> >> Guozhang
> >>
> >> On Mon, Feb 1, 2016 at 3:17 PM, Guozhang Wang <wa...@gmail.com>
> wrote:
> >>
> >> > Hi Han,
> >> >
> >> > I looked at your test code and actually the error is in this line:
> >> >
> >>
> https://gist.github.com/darkjh/437ac72cdd4b1c4ca2e7#file-kafkabug2-scala-L61
> >> >
> >> > where you call "commitSync" in the finally block, which will commit
> >> > messages that is returned to you from poll() call.
> >> >
> >> >
> >> > More specifically, for example your poll() call returned you a set of
> >> > messages with offset 0 to 100. From the consumer's point of view once
> >> they
> >> > are returned to the user they are considered "consumed", and hence if
> >> you
> >> > call commitSync after that they will ALL be committed (i.e. consumer
> >> will
> >> > commit offset 100). But if you hit an exception / got a close signal
> >> while,
> >> > say, processing message with offset 50, then call commitSync in the
> >> finally
> >> > block you will effectively lose messages 50 to 100.
> >> >
> >> > Hence as a user of the consumer, one should only call "commit" if she
> is
> >> > certain that all messages returned from "poll()" have been processed.
> >> >
> >> > Guozhang
> >> >
> >> >
> >> > On Mon, Feb 1, 2016 at 9:59 AM, Han JU <ju...@gmail.com>
> wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> One of our usage of kafka is to tolerate arbitrary consumer crash
> >> without
> >> >> losing or duplicating messages. So in our code we manually commit
> >> offset
> >> >> after successfully persisted the consumer state.
> >> >>
> >> >> In prototyping with kafka-0.9's new consumer API, I found that in
> some
> >> >> cases, kafka failed to send a part of messages to the consumers even
> if
> >> >> the
> >> >> offsets are handled correctly.
> >> >>
> >> >> I've made sure that this time everything is latest on 0.9.0 branch
> >> >> (d1ff6c7) for both broker and client code.
> >> >>
> >> >> Test code snippet is here:
> >> >>   https://gist.github.com/darkjh/437ac72cdd4b1c4ca2e7
> >> >>
> >> >> Test setup:
> >> >>   - 12 partitions
> >> >>   - 2 consumer app process with 2 consumer thread each
> >> >>   - producer produces exactly 1.2M messages in about 2 minutes
> (enough
> >> >> time
> >> >> for us to manual kill -9 consumer)
> >> >>   - a consumer thread commits offset on each 80k messages received
> (to
> >> >> simulate our regularly offset commit)
> >> >>   - after all messages are consumed, each consumer thread will write
> a
> >> >> number in file indicating how much message it has received. So all
> >> numbers
> >> >> should sum to exactly 1.2M if everything goes well
> >> >>
> >> >> Test run:
> >> >>   - run the producer
> >> >>   - run the 2 consumer app process in the same time
> >> >>   - wait for the first commit offset (first 80k messages received in
> >> each
> >> >> consumer thread)
> >> >>   - after the first commit offset, kill -9 one of the consumer app
> >> >>   - let another consumer app runs till messages are finished
> >> >>   - check the files written by the remaining consumer threads
> >> >>
> >> >> And after that, by checking the file, we do not receive 1.2M message
> >> but
> >> >> roughly at 1.04M. The lag on kafka of this topic is 0.
> >> >> If you check the logs of the consumer app with DEBUG level, you'll
> find
> >> >> out
> >> >> that the offsets are correctly handled. 30s (default timeout) after
> the
> >> >> kill -9 of one consumer app, the remaining consumer app correctly
> gets
> >> >> assigned all the partitions and it starts right from the offsets that
> >> the
> >> >> crashed consumer has previously committed. So this makes the message
> >> lost
> >> >> quite mysterious for us.
> >> >> Note that the kill -9 moment is important. If we kill -9 one consumer
> >> app
> >> >> *before* the first commit offset, everything goes well. All messages
> >> >> received, no lost. But when killed *after* the first commit offset,
> >> >> there'll be messages lost.
> >> >>
> >> >> Hope the code is clear to reproduce the problem. I'm available for
> any
> >> >> further details needed.
> >> >>
> >> >> Thanks!
> >> >> --
> >> >> *JU Han*
> >> >>
> >> >> Software Engineer @ Teads.tv
> >> >>
> >> >> +33 0619608888
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > -- Guozhang
> >> >
> >>
> >>
> >>
> >> --
> >> -- Guozhang
> >>
> >
> >
> >
> > --
> > *JU Han*
> >
> > Software Engineer @ Teads.tv
> >
> > +33 0619608888
> >
>
>
>
> --
> *JU Han*
>
> Software Engineer @ Teads.tv
>
> +33 0619608888
>



-- 
-- Guozhang

Re: Message lost after consumer crash in kafka 0.9

Posted by Han JU <ju...@gmail.com>.
Sorry in fact the test code in gist does not exactly reproduce the problem
we're facing. I'm working on that.

2016-02-02 10:46 GMT+01:00 Han JU <ju...@gmail.com>:

> Thanks Guazhang for the reply!
>
> So in fact if it's the case you said, if I understand correctly, then the
> messages lost should be the last messages. But in our use case it is not
> the last messages get lost. And this does not explain that the different
> behavior depending on `kill -9` moment (before a commit or after a commit).
> If a consumer app is killed before the first flush/commit then every
> messages is received correctly.
>
> For the messages lost, our real app code flushes state and commits offset
> regularly (say for 15m). In my test, say I've 45m's data, so I'll have 2
> flush/commit point and 3 trunk of flushed data. If a consumer app process
> is `kill -9` after the first flush/commit point and I let the remaining app
> runs till the end. I got message lost only in the second trunk. Both first
> and third trunk are perfectly handled.
>
> 2016-02-02 0:18 GMT+01:00 Guozhang Wang <wa...@gmail.com>:
>
>> One thing to add, is that by doing this you could possibly get duplicates
>> but not data loss, which obeys Kafka's at-least once semantics.
>>
>> Guozhang
>>
>> On Mon, Feb 1, 2016 at 3:17 PM, Guozhang Wang <wa...@gmail.com> wrote:
>>
>> > Hi Han,
>> >
>> > I looked at your test code and actually the error is in this line:
>> >
>> https://gist.github.com/darkjh/437ac72cdd4b1c4ca2e7#file-kafkabug2-scala-L61
>> >
>> > where you call "commitSync" in the finally block, which will commit
>> > messages that is returned to you from poll() call.
>> >
>> >
>> > More specifically, for example your poll() call returned you a set of
>> > messages with offset 0 to 100. From the consumer's point of view once
>> they
>> > are returned to the user they are considered "consumed", and hence if
>> you
>> > call commitSync after that they will ALL be committed (i.e. consumer
>> will
>> > commit offset 100). But if you hit an exception / got a close signal
>> while,
>> > say, processing message with offset 50, then call commitSync in the
>> finally
>> > block you will effectively lose messages 50 to 100.
>> >
>> > Hence as a user of the consumer, one should only call "commit" if she is
>> > certain that all messages returned from "poll()" have been processed.
>> >
>> > Guozhang
>> >
>> >
>> > On Mon, Feb 1, 2016 at 9:59 AM, Han JU <ju...@gmail.com> wrote:
>> >
>> >> Hi,
>> >>
>> >> One of our usage of kafka is to tolerate arbitrary consumer crash
>> without
>> >> losing or duplicating messages. So in our code we manually commit
>> offset
>> >> after successfully persisted the consumer state.
>> >>
>> >> In prototyping with kafka-0.9's new consumer API, I found that in some
>> >> cases, kafka failed to send a part of messages to the consumers even if
>> >> the
>> >> offsets are handled correctly.
>> >>
>> >> I've made sure that this time everything is latest on 0.9.0 branch
>> >> (d1ff6c7) for both broker and client code.
>> >>
>> >> Test code snippet is here:
>> >>   https://gist.github.com/darkjh/437ac72cdd4b1c4ca2e7
>> >>
>> >> Test setup:
>> >>   - 12 partitions
>> >>   - 2 consumer app process with 2 consumer thread each
>> >>   - producer produces exactly 1.2M messages in about 2 minutes (enough
>> >> time
>> >> for us to manual kill -9 consumer)
>> >>   - a consumer thread commits offset on each 80k messages received (to
>> >> simulate our regularly offset commit)
>> >>   - after all messages are consumed, each consumer thread will write a
>> >> number in file indicating how much message it has received. So all
>> numbers
>> >> should sum to exactly 1.2M if everything goes well
>> >>
>> >> Test run:
>> >>   - run the producer
>> >>   - run the 2 consumer app process in the same time
>> >>   - wait for the first commit offset (first 80k messages received in
>> each
>> >> consumer thread)
>> >>   - after the first commit offset, kill -9 one of the consumer app
>> >>   - let another consumer app runs till messages are finished
>> >>   - check the files written by the remaining consumer threads
>> >>
>> >> And after that, by checking the file, we do not receive 1.2M message
>> but
>> >> roughly at 1.04M. The lag on kafka of this topic is 0.
>> >> If you check the logs of the consumer app with DEBUG level, you'll find
>> >> out
>> >> that the offsets are correctly handled. 30s (default timeout) after the
>> >> kill -9 of one consumer app, the remaining consumer app correctly gets
>> >> assigned all the partitions and it starts right from the offsets that
>> the
>> >> crashed consumer has previously committed. So this makes the message
>> lost
>> >> quite mysterious for us.
>> >> Note that the kill -9 moment is important. If we kill -9 one consumer
>> app
>> >> *before* the first commit offset, everything goes well. All messages
>> >> received, no lost. But when killed *after* the first commit offset,
>> >> there'll be messages lost.
>> >>
>> >> Hope the code is clear to reproduce the problem. I'm available for any
>> >> further details needed.
>> >>
>> >> Thanks!
>> >> --
>> >> *JU Han*
>> >>
>> >> Software Engineer @ Teads.tv
>> >>
>> >> +33 0619608888
>> >>
>> >
>> >
>> >
>> > --
>> > -- Guozhang
>> >
>>
>>
>>
>> --
>> -- Guozhang
>>
>
>
>
> --
> *JU Han*
>
> Software Engineer @ Teads.tv
>
> +33 0619608888
>



-- 
*JU Han*

Software Engineer @ Teads.tv

+33 0619608888

Re: Message lost after consumer crash in kafka 0.9

Posted by Han JU <ju...@gmail.com>.
Thanks Guazhang for the reply!

So in fact if it's the case you said, if I understand correctly, then the
messages lost should be the last messages. But in our use case it is not
the last messages get lost. And this does not explain that the different
behavior depending on `kill -9` moment (before a commit or after a commit).
If a consumer app is killed before the first flush/commit then every
messages is received correctly.

For the messages lost, our real app code flushes state and commits offset
regularly (say for 15m). In my test, say I've 45m's data, so I'll have 2
flush/commit point and 3 trunk of flushed data. If a consumer app process
is `kill -9` after the first flush/commit point and I let the remaining app
runs till the end. I got message lost only in the second trunk. Both first
and third trunk are perfectly handled.

2016-02-02 0:18 GMT+01:00 Guozhang Wang <wa...@gmail.com>:

> One thing to add, is that by doing this you could possibly get duplicates
> but not data loss, which obeys Kafka's at-least once semantics.
>
> Guozhang
>
> On Mon, Feb 1, 2016 at 3:17 PM, Guozhang Wang <wa...@gmail.com> wrote:
>
> > Hi Han,
> >
> > I looked at your test code and actually the error is in this line:
> >
> https://gist.github.com/darkjh/437ac72cdd4b1c4ca2e7#file-kafkabug2-scala-L61
> >
> > where you call "commitSync" in the finally block, which will commit
> > messages that is returned to you from poll() call.
> >
> >
> > More specifically, for example your poll() call returned you a set of
> > messages with offset 0 to 100. From the consumer's point of view once
> they
> > are returned to the user they are considered "consumed", and hence if you
> > call commitSync after that they will ALL be committed (i.e. consumer will
> > commit offset 100). But if you hit an exception / got a close signal
> while,
> > say, processing message with offset 50, then call commitSync in the
> finally
> > block you will effectively lose messages 50 to 100.
> >
> > Hence as a user of the consumer, one should only call "commit" if she is
> > certain that all messages returned from "poll()" have been processed.
> >
> > Guozhang
> >
> >
> > On Mon, Feb 1, 2016 at 9:59 AM, Han JU <ju...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> One of our usage of kafka is to tolerate arbitrary consumer crash
> without
> >> losing or duplicating messages. So in our code we manually commit offset
> >> after successfully persisted the consumer state.
> >>
> >> In prototyping with kafka-0.9's new consumer API, I found that in some
> >> cases, kafka failed to send a part of messages to the consumers even if
> >> the
> >> offsets are handled correctly.
> >>
> >> I've made sure that this time everything is latest on 0.9.0 branch
> >> (d1ff6c7) for both broker and client code.
> >>
> >> Test code snippet is here:
> >>   https://gist.github.com/darkjh/437ac72cdd4b1c4ca2e7
> >>
> >> Test setup:
> >>   - 12 partitions
> >>   - 2 consumer app process with 2 consumer thread each
> >>   - producer produces exactly 1.2M messages in about 2 minutes (enough
> >> time
> >> for us to manual kill -9 consumer)
> >>   - a consumer thread commits offset on each 80k messages received (to
> >> simulate our regularly offset commit)
> >>   - after all messages are consumed, each consumer thread will write a
> >> number in file indicating how much message it has received. So all
> numbers
> >> should sum to exactly 1.2M if everything goes well
> >>
> >> Test run:
> >>   - run the producer
> >>   - run the 2 consumer app process in the same time
> >>   - wait for the first commit offset (first 80k messages received in
> each
> >> consumer thread)
> >>   - after the first commit offset, kill -9 one of the consumer app
> >>   - let another consumer app runs till messages are finished
> >>   - check the files written by the remaining consumer threads
> >>
> >> And after that, by checking the file, we do not receive 1.2M message but
> >> roughly at 1.04M. The lag on kafka of this topic is 0.
> >> If you check the logs of the consumer app with DEBUG level, you'll find
> >> out
> >> that the offsets are correctly handled. 30s (default timeout) after the
> >> kill -9 of one consumer app, the remaining consumer app correctly gets
> >> assigned all the partitions and it starts right from the offsets that
> the
> >> crashed consumer has previously committed. So this makes the message
> lost
> >> quite mysterious for us.
> >> Note that the kill -9 moment is important. If we kill -9 one consumer
> app
> >> *before* the first commit offset, everything goes well. All messages
> >> received, no lost. But when killed *after* the first commit offset,
> >> there'll be messages lost.
> >>
> >> Hope the code is clear to reproduce the problem. I'm available for any
> >> further details needed.
> >>
> >> Thanks!
> >> --
> >> *JU Han*
> >>
> >> Software Engineer @ Teads.tv
> >>
> >> +33 0619608888
> >>
> >
> >
> >
> > --
> > -- Guozhang
> >
>
>
>
> --
> -- Guozhang
>



-- 
*JU Han*

Software Engineer @ Teads.tv

+33 0619608888

Re: Message lost after consumer crash in kafka 0.9

Posted by Guozhang Wang <wa...@gmail.com>.
One thing to add, is that by doing this you could possibly get duplicates
but not data loss, which obeys Kafka's at-least once semantics.

Guozhang

On Mon, Feb 1, 2016 at 3:17 PM, Guozhang Wang <wa...@gmail.com> wrote:

> Hi Han,
>
> I looked at your test code and actually the error is in this line:
> https://gist.github.com/darkjh/437ac72cdd4b1c4ca2e7#file-kafkabug2-scala-L61
>
> where you call "commitSync" in the finally block, which will commit
> messages that is returned to you from poll() call.
>
>
> More specifically, for example your poll() call returned you a set of
> messages with offset 0 to 100. From the consumer's point of view once they
> are returned to the user they are considered "consumed", and hence if you
> call commitSync after that they will ALL be committed (i.e. consumer will
> commit offset 100). But if you hit an exception / got a close signal while,
> say, processing message with offset 50, then call commitSync in the finally
> block you will effectively lose messages 50 to 100.
>
> Hence as a user of the consumer, one should only call "commit" if she is
> certain that all messages returned from "poll()" have been processed.
>
> Guozhang
>
>
> On Mon, Feb 1, 2016 at 9:59 AM, Han JU <ju...@gmail.com> wrote:
>
>> Hi,
>>
>> One of our usage of kafka is to tolerate arbitrary consumer crash without
>> losing or duplicating messages. So in our code we manually commit offset
>> after successfully persisted the consumer state.
>>
>> In prototyping with kafka-0.9's new consumer API, I found that in some
>> cases, kafka failed to send a part of messages to the consumers even if
>> the
>> offsets are handled correctly.
>>
>> I've made sure that this time everything is latest on 0.9.0 branch
>> (d1ff6c7) for both broker and client code.
>>
>> Test code snippet is here:
>>   https://gist.github.com/darkjh/437ac72cdd4b1c4ca2e7
>>
>> Test setup:
>>   - 12 partitions
>>   - 2 consumer app process with 2 consumer thread each
>>   - producer produces exactly 1.2M messages in about 2 minutes (enough
>> time
>> for us to manual kill -9 consumer)
>>   - a consumer thread commits offset on each 80k messages received (to
>> simulate our regularly offset commit)
>>   - after all messages are consumed, each consumer thread will write a
>> number in file indicating how much message it has received. So all numbers
>> should sum to exactly 1.2M if everything goes well
>>
>> Test run:
>>   - run the producer
>>   - run the 2 consumer app process in the same time
>>   - wait for the first commit offset (first 80k messages received in each
>> consumer thread)
>>   - after the first commit offset, kill -9 one of the consumer app
>>   - let another consumer app runs till messages are finished
>>   - check the files written by the remaining consumer threads
>>
>> And after that, by checking the file, we do not receive 1.2M message but
>> roughly at 1.04M. The lag on kafka of this topic is 0.
>> If you check the logs of the consumer app with DEBUG level, you'll find
>> out
>> that the offsets are correctly handled. 30s (default timeout) after the
>> kill -9 of one consumer app, the remaining consumer app correctly gets
>> assigned all the partitions and it starts right from the offsets that the
>> crashed consumer has previously committed. So this makes the message lost
>> quite mysterious for us.
>> Note that the kill -9 moment is important. If we kill -9 one consumer app
>> *before* the first commit offset, everything goes well. All messages
>> received, no lost. But when killed *after* the first commit offset,
>> there'll be messages lost.
>>
>> Hope the code is clear to reproduce the problem. I'm available for any
>> further details needed.
>>
>> Thanks!
>> --
>> *JU Han*
>>
>> Software Engineer @ Teads.tv
>>
>> +33 0619608888
>>
>
>
>
> --
> -- Guozhang
>



-- 
-- Guozhang

Re: Message lost after consumer crash in kafka 0.9

Posted by Guozhang Wang <wa...@gmail.com>.
Hi Han,

I looked at your test code and actually the error is in this line:
https://gist.github.com/darkjh/437ac72cdd4b1c4ca2e7#file-kafkabug2-scala-L61

where you call "commitSync" in the finally block, which will commit
messages that is returned to you from poll() call.


More specifically, for example your poll() call returned you a set of
messages with offset 0 to 100. From the consumer's point of view once they
are returned to the user they are considered "consumed", and hence if you
call commitSync after that they will ALL be committed (i.e. consumer will
commit offset 100). But if you hit an exception / got a close signal while,
say, processing message with offset 50, then call commitSync in the finally
block you will effectively lose messages 50 to 100.

Hence as a user of the consumer, one should only call "commit" if she is
certain that all messages returned from "poll()" have been processed.

Guozhang


On Mon, Feb 1, 2016 at 9:59 AM, Han JU <ju...@gmail.com> wrote:

> Hi,
>
> One of our usage of kafka is to tolerate arbitrary consumer crash without
> losing or duplicating messages. So in our code we manually commit offset
> after successfully persisted the consumer state.
>
> In prototyping with kafka-0.9's new consumer API, I found that in some
> cases, kafka failed to send a part of messages to the consumers even if the
> offsets are handled correctly.
>
> I've made sure that this time everything is latest on 0.9.0 branch
> (d1ff6c7) for both broker and client code.
>
> Test code snippet is here:
>   https://gist.github.com/darkjh/437ac72cdd4b1c4ca2e7
>
> Test setup:
>   - 12 partitions
>   - 2 consumer app process with 2 consumer thread each
>   - producer produces exactly 1.2M messages in about 2 minutes (enough time
> for us to manual kill -9 consumer)
>   - a consumer thread commits offset on each 80k messages received (to
> simulate our regularly offset commit)
>   - after all messages are consumed, each consumer thread will write a
> number in file indicating how much message it has received. So all numbers
> should sum to exactly 1.2M if everything goes well
>
> Test run:
>   - run the producer
>   - run the 2 consumer app process in the same time
>   - wait for the first commit offset (first 80k messages received in each
> consumer thread)
>   - after the first commit offset, kill -9 one of the consumer app
>   - let another consumer app runs till messages are finished
>   - check the files written by the remaining consumer threads
>
> And after that, by checking the file, we do not receive 1.2M message but
> roughly at 1.04M. The lag on kafka of this topic is 0.
> If you check the logs of the consumer app with DEBUG level, you'll find out
> that the offsets are correctly handled. 30s (default timeout) after the
> kill -9 of one consumer app, the remaining consumer app correctly gets
> assigned all the partitions and it starts right from the offsets that the
> crashed consumer has previously committed. So this makes the message lost
> quite mysterious for us.
> Note that the kill -9 moment is important. If we kill -9 one consumer app
> *before* the first commit offset, everything goes well. All messages
> received, no lost. But when killed *after* the first commit offset,
> there'll be messages lost.
>
> Hope the code is clear to reproduce the problem. I'm available for any
> further details needed.
>
> Thanks!
> --
> *JU Han*
>
> Software Engineer @ Teads.tv
>
> +33 0619608888
>



-- 
-- Guozhang