You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Dmitriy Vsekhvalnov <dv...@gmail.com> on 2017/10/06 14:52:13 UTC

kafka broker loosing offsets?

Hi all,

we several time faced situation where consumer-group started to re-consume
old events from beginning. Here is scenario:

1. x3 broker kafka cluster on top of x3 node zookeeper
2. RF=3 for all topics
3. log.retention.hours=168 and offsets.retention.minutes=20160
4. running sustainable load (pushing events)
5. doing disaster testing by randomly shutting down 1 of 3 broker nodes
(then provision new broker back)

Several times after bouncing broker we faced situation where consumer group
started to re-consume old events.

consumer group:

1. enable.auto.commit = false
2. tried graceful group shutdown, kill -9 and terminating AWS nodes
3. never experienced re-consumption for given cases.

What can cause that old events re-consumption? Is it related to bouncing
one of brokers? What to search in a logs? Any broker settings to try?

Thanks in advance.

Re: kafka broker loosing offsets?

Posted by Eric Ho <er...@qventus.com>.

unsubscribe


*Eric Ho | Qventus*
Awarded Top Innovation in Cost Savings
<http://www.healthcare-informatics.com/article/analytics/2017-healthcare-informatics-innovator-awards-innovation-cost-savings-winner>

On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy Vsekhvalnov <dv...@gmail.com>
wrote:

> Hi all,
>
> we several time faced situation where consumer-group started to re-consume
> old events from beginning. Here is scenario:
>
> 1. x3 broker kafka cluster on top of x3 node zookeeper
> 2. RF=3 for all topics
> 3. log.retention.hours=168 and offsets.retention.minutes=20160
> 4. running sustainable load (pushing events)
> 5. doing disaster testing by randomly shutting down 1 of 3 broker nodes
> (then provision new broker back)
>
> Several times after bouncing broker we faced situation where consumer group
> started to re-consume old events.
>
> consumer group:
>
> 1. enable.auto.commit = false
> 2. tried graceful group shutdown, kill -9 and terminating AWS nodes
> 3. never experienced re-consumption for given cases.
>
> What can cause that old events re-consumption? Is it related to bouncing
> one of brokers? What to search in a logs? Any broker settings to try?
>
> Thanks in advance.
>

Re: kafka broker loosing offsets?

Posted by Vincent Dautremont <vi...@olamobile.com.INVALID>.

Hi,
I'm having the same setup as Dimitry, I've experienced exactly the same
issue already 2 times this last month.
(the only difference with Dimitry's setup is that I have librdkafka 0.9.5
clients.

It's like if the __consumer_offsets partitions were not synced but still
reported as synced (and so the syncing would never be restarted/continued).
I've never experienced that on Kafka 0.9.x or 0.10.x clusters.
the 0.11.0.0 cluster when it happened got upgraded to 0.11.0.1 in hope it
fixes this.

On Fri, Oct 6, 2017 at 5:35 PM, Manikumar <ma...@gmail.com> wrote:

> normally, log.retention.hours (168hrs)  should be higher than
> offsets.retention.minutes (336 hrs)?
>
>
> On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy Vsekhvalnov <
> dvsekhvalnov@gmail.com>
> wrote:
>
> > Hi Ted,
> >
> > Broker: v0.11.0.0
> >
> > Consumer:
> > kafka-clients v0.11.0.0
> > auto.offset.reset = earliest
> >
> >
> >
> > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > What's the value for auto.offset.reset  ?
> > >
> > > Which release are you using ?
> > >
> > > Cheers
> > >
> > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy Vsekhvalnov <
> > > dvsekhvalnov@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > we several time faced situation where consumer-group started to
> > > re-consume
> > > > old events from beginning. Here is scenario:
> > > >
> > > > 1. x3 broker kafka cluster on top of x3 node zookeeper
> > > > 2. RF=3 for all topics
> > > > 3. log.retention.hours=168 and offsets.retention.minutes=20160
> > > > 4. running sustainable load (pushing events)
> > > > 5. doing disaster testing by randomly shutting down 1 of 3 broker
> nodes
> > > > (then provision new broker back)
> > > >
> > > > Several times after bouncing broker we faced situation where consumer
> > > group
> > > > started to re-consume old events.
> > > >
> > > > consumer group:
> > > >
> > > > 1. enable.auto.commit = false
> > > > 2. tried graceful group shutdown, kill -9 and terminating AWS nodes
> > > > 3. never experienced re-consumption for given cases.
> > > >
> > > > What can cause that old events re-consumption? Is it related to
> > bouncing
> > > > one of brokers? What to search in a logs? Any broker settings to try?
> > > >
> > > > Thanks in advance.
> > > >
> > >
> >
>

-- 
The information transmitted is intended only for the person or entity to 
which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or 
taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received 
this in error, please contact the sender and delete the material from any 
computer.

Re: kafka broker loosing offsets?

Posted by Michal Michalski <mi...@zalando.ie>.

Hi Dmitriy,

I didn't follow the whole thread, but if it's not an issue with Kafka
0.11.0.0 (there was another thread about it recently), make sure your
Replication Factor for the offsets topic is 3 (you mentioned "RF=3 for all
topics", but I wasn't sure it includes the offsets one).

There was a bug in older Kafka versions [1] that should be fixed already in
0.11, but *if* your offset topic was created earlier (e.g. you were running
older Kafka version that you only recently upgraded) it might be not
replicated as you'd expect it to be (RF=1 rather than 3) *and* if you're
using ephemeral storage (e.g. AWS EC2 instance storage), it would mean that
restarting node wipes out offset data you're looking for, so you always
start from scratch. It sounds like an unlikely scenario that would require
some very specific preconditions and a bit of bad luck, but if you check
everything else and run out of other ideas, maybe it's worth checking this
possibility as well :-)

[1] https://issues.apache.org/jira/browse/KAFKA-3959

Michal


On 11 October 2017 at 16:44, Dmitriy Vsekhvalnov <dv...@gmail.com>
wrote:

> Hey, want to resurrect this thread.
>
> Decided to do idle test, where no load data is produced to topic at all.
> And when we kill #101 or #102 - nothing happening. But when we kill #200 -
> consumers starts to re-consume old events from random position.
>
> Anybody have ideas what to check?  I really expected that Kafka will fail
> symmetrical with respect to any broker.
>
> On Mon, Oct 9, 2017 at 6:26 PM, Dmitriy Vsekhvalnov <
> dvsekhvalnov@gmail.com>
> wrote:
>
> > Hi tao,
> >
> > we had unclean leader election enabled at the beginning. But then
> disabled
> > it and also reduced 'max.poll.records' value.  It helped little bit.
> >
> > But after today's testing there is strong correlation between lag spike
> > and what broker we crash. For lowest ID (100) broker :
> >   1. it always at least 1-2 orders higher lag
> >   2. we start getting
> >
> > org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot
> be
> > completed since the group has already rebalanced and assigned the
> > partitions to another member. This means that the time between subsequent
> > calls to poll() was longer than the configured max.poll.interval.ms,
> > which typically implies that the poll loop is spending too much time
> > message processing. You can address this either by increasing the session
> > timeout or by reducing the maximum size of batches returned in poll()
> with
> > max.poll.records.
> >
> >   3. sometime re-consumption from random position
> >
> > And when we crashing other brokers (101, 102), it just lag spike of ~10Ks
> > order, settle down quite quickly, no consumer exceptions.
> >
> > Totally lost what to try next.
> >
> > On Sat, Oct 7, 2017 at 2:41 AM, tao xiao <xi...@gmail.com> wrote:
> >
> >> Do you have unclean leader election turned on? If killing 100 is the
> only
> >> way to reproduce the problem, it is possible with unclean leader
> election
> >> turned on that leadership was transferred to out of ISR follower which
> may
> >> not have the latest high watermark
> >> On Sat, Oct 7, 2017 at 3:51 AM Dmitriy Vsekhvalnov <
> >> dvsekhvalnov@gmail.com>
> >> wrote:
> >>
> >> > About to verify hypothesis on monday, but looks like that in latest
> >> tests.
> >> > Need to double check.
> >> >
> >> > On Fri, Oct 6, 2017 at 11:25 PM, Stas Chizhov <sc...@gmail.com>
> >> wrote:
> >> >
> >> > > So no matter in what sequence you shutdown brokers it is only 1 that
> >> > causes
> >> > > the major problem? That would indeed be a bit weird. have you
> checked
> >> > > offsets of your consumer - right after offsets jump back - does it
> >> start
> >> > > from the topic start or does it go back to some random position?
> Have
> >> you
> >> > > checked if all offsets are actually being committed by consumers?
> >> > >
> >> > > fre 6 okt. 2017 kl. 20:59 skrev Dmitriy Vsekhvalnov <
> >> > > dvsekhvalnov@gmail.com
> >> > > >:
> >> > >
> >> > > > Yeah, probably we can dig around.
> >> > > >
> >> > > > One more observation, the most lag/re-consumption trouble
> happening
> >> > when
> >> > > we
> >> > > > kill broker with lowest id (e.g. 100 from [100,101,102]).
> >> > > > When crashing other brokers - there is nothing special happening,
> >> lag
> >> > > > growing little bit but nothing crazy (e.g. thousands, not
> millions).
> >> > > >
> >> > > > Is it sounds suspicious?
> >> > > >
> >> > > > On Fri, Oct 6, 2017 at 9:23 PM, Stas Chizhov <sc...@gmail.com>
> >> > wrote:
> >> > > >
> >> > > > > Ted: when choosing earliest/latest you are saying: if it happens
> >> that
> >> > > > there
> >> > > > > is no "valid" offset committed for a consumer (for whatever
> >> reason:
> >> > > > > bug/misconfiguration/no luck) it will be ok to start from the
> >> > beginning
> >> > > > or
> >> > > > > end of the topic. So if you are not ok with that you should
> choose
> >> > > none.
> >> > > > >
> >> > > > > Dmitriy: Ok. Then it is spring-kafka that maintains this offset
> >> per
> >> > > > > partition state for you. it might also has that problem of
> leaving
> >> > > stale
> >> > > > > offsets lying around, After quickly looking through
> >> > > > > https://github.com/spring-projects/spring-kafka/blob/
> >> > > > > 1945f29d5518e3c4a9950ba82135420dfb61e808/spring-kafka/src/
> >> > > > > main/java/org/springframework/kafka/listener/
> >> > > > > KafkaMessageListenerContainer.java
> >> > > > > it looks possible since offsets map is not cleared upon
> partition
> >> > > > > revocation, but that is just a hypothesis. I have no experience
> >> with
> >> > > > > spring-kafka. However since you say you consumers were always
> >> active
> >> > I
> >> > > > find
> >> > > > > this theory worth investigating.
> >> > > > >
> >> > > > >
> >> > > > > 2017-10-06 18:20 GMT+02:00 Vincent Dautremont <
> >> > > > > vincent.dautremont@olamobile.com.invalid>:
> >> > > > >
> >> > > > > > is there a way to read messages on a topic partition from a
> >> > specific
> >> > > > node
> >> > > > > > we that we choose (and not by the topic partition leader) ?
> >> > > > > > I would like to read myself that each of the
> __consumer_offsets
> >> > > > partition
> >> > > > > > replicas have the same consumer group offset written in it in
> >> it.
> >> > > > > >
> >> > > > > > On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov <
> >> > > > > > dvsekhvalnov@gmail.com>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Stas:
> >> > > > > > >
> >> > > > > > > we rely on spring-kafka, it  commits offsets "manually" for
> us
> >> > > after
> >> > > > > > event
> >> > > > > > > handler completed. So it's kind of automatic once there is
> >> > constant
> >> > > > > > stream
> >> > > > > > > of events (no idle time, which is true for us). Though it's
> >> not
> >> > > what
> >> > > > > pure
> >> > > > > > > kafka-client calls "automatic" (flush commits at fixed
> >> > intervals).
> >> > > > > > >
> >> > > > > > > On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov <
> >> schizhov@gmail.com
> >> > >
> >> > > > > wrote:
> >> > > > > > >
> >> > > > > > > > You don't have autocmmit enables that means you commit
> >> offsets
> >> > > > > > yourself -
> >> > > > > > > > correct? If you store them per partition somewhere and
> fail
> >> to
> >> > > > clean
> >> > > > > it
> >> > > > > > > up
> >> > > > > > > > upon rebalance next time the consumer gets this partition
> >> > > assigned
> >> > > > > > during
> >> > > > > > > > next rebalance it can commit old stale offset- can this be
> >> the
> >> > > > case?
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <
> >> > > > > > > > dvsekhvalnov@gmail.com
> >> > > > > > > > >:
> >> > > > > > > >
> >> > > > > > > > > Reprocessing same events again - is fine for us
> >> (idempotent).
> >> > > > While
> >> > > > > > > > loosing
> >> > > > > > > > > data is more critical.
> >> > > > > > > > >
> >> > > > > > > > > What are reasons of such behaviour? Consumers are never
> >> idle,
> >> > > > > always
> >> > > > > > > > > commiting, probably something wrong with broker setup
> >> then?
> >> > > > > > > > >
> >> > > > > > > > > On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <
> >> yuzhihong@gmail.com>
> >> > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > > Stas:
> >> > > > > > > > > > bq.  using anything but none is not really an option
> >> > > > > > > > > >
> >> > > > > > > > > > If you have time, can you explain a bit more ?
> >> > > > > > > > > >
> >> > > > > > > > > > Thanks
> >> > > > > > > > > >
> >> > > > > > > > > > On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <
> >> > > > schizhov@gmail.com
> >> > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > If you set auto.offset.reset to none next time it
> >> happens
> >> > > you
> >> > > > > > will
> >> > > > > > > be
> >> > > > > > > > > in
> >> > > > > > > > > > > much better position to find out what happens. Also
> in
> >> > > > general
> >> > > > > > with
> >> > > > > > > > > > current
> >> > > > > > > > > > > semantics of offset reset policy IMO using anything
> >> but
> >> > > none
> >> > > > is
> >> > > > > > not
> >> > > > > > > > > > really
> >> > > > > > > > > > > an option unless it is ok for consumer to loose some
> >> data
> >> > > > > > (latest)
> >> > > > > > > or
> >> > > > > > > > > > > reprocess it second time (earliest).
> >> > > > > > > > > > >
> >> > > > > > > > > > > fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <
> >> > > yuzhihong@gmail.com
> >> > > > >:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > Should Kafka log warning if log.retention.hours is
> >> > lower
> >> > > > than
> >> > > > > > > > number
> >> > > > > > > > > of
> >> > > > > > > > > > > > hours specified by offsets.retention.minutes ?
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <
> >> > > > > > > > manikumar.reddy@gmail.com
> >> > > > > > > > > >
> >> > > > > > > > > > > > wrote:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > normally, log.retention.hours (168hrs)  should
> be
> >> > > higher
> >> > > > > than
> >> > > > > > > > > > > > > offsets.retention.minutes (336 hrs)?
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy
> >> Vsekhvalnov <
> >> > > > > > > > > > > > > dvsekhvalnov@gmail.com>
> >> > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Hi Ted,
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Broker: v0.11.0.0
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Consumer:
> >> > > > > > > > > > > > > > kafka-clients v0.11.0.0
> >> > > > > > > > > > > > > > auto.offset.reset = earliest
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <
> >> > > > > > yuzhihong@gmail.com>
> >> > > > > > > > > > wrote:
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > What's the value for auto.offset.reset  ?
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Which release are you using ?
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Cheers
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy
> >> > > Vsekhvalnov <
> >> > > > > > > > > > > > > > > dvsekhvalnov@gmail.com>
> >> > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Hi all,
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > we several time faced situation where
> >> > > > consumer-group
> >> > > > > > > > started
> >> > > > > > > > > to
> >> > > > > > > > > > > > > > > re-consume
> >> > > > > > > > > > > > > > > > old events from beginning. Here is
> scenario:
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > 1. x3 broker kafka cluster on top of x3
> node
> >> > > > > zookeeper
> >> > > > > > > > > > > > > > > > 2. RF=3 for all topics
> >> > > > > > > > > > > > > > > > 3. log.retention.hours=168 and
> >> > > > > > > > > offsets.retention.minutes=20160
> >> > > > > > > > > > > > > > > > 4. running sustainable load (pushing
> events)
> >> > > > > > > > > > > > > > > > 5. doing disaster testing by randomly
> >> shutting
> >> > > > down 1
> >> > > > > > of
> >> > > > > > > 3
> >> > > > > > > > > > broker
> >> > > > > > > > > > > > > nodes
> >> > > > > > > > > > > > > > > > (then provision new broker back)
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Several times after bouncing broker we
> faced
> >> > > > > situation
> >> > > > > > > > where
> >> > > > > > > > > > > > consumer
> >> > > > > > > > > > > > > > > group
> >> > > > > > > > > > > > > > > > started to re-consume old events.
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > consumer group:
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > 1. enable.auto.commit = false
> >> > > > > > > > > > > > > > > > 2. tried graceful group shutdown, kill -9
> >> and
> >> > > > > > terminating
> >> > > > > > > > AWS
> >> > > > > > > > > > > nodes
> >> > > > > > > > > > > > > > > > 3. never experienced re-consumption for
> >> given
> >> > > > cases.
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > What can cause that old events
> >> re-consumption?
> >> > Is
> >> > > > it
> >> > > > > > > > related
> >> > > > > > > > > to
> >> > > > > > > > > > > > > > bouncing
> >> > > > > > > > > > > > > > > > one of brokers? What to search in a logs?
> >> Any
> >> > > > broker
> >> > > > > > > > settings
> >> > > > > > > > > > to
> >> > > > > > > > > > > > try?
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Thanks in advance.
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > The information transmitted is intended only for the person or
> >> > entity
> >> > > > to
> >> > > > > > which it is addressed and may contain confidential and/or
> >> > privileged
> >> > > > > > material. Any review, retransmission, dissemination or other
> use
> >> > of,
> >> > > or
> >> > > > > > taking of any action in reliance upon, this information by
> >> persons
> >> > or
> >> > > > > > entities other than the intended recipient is prohibited. If
> you
> >> > > > received
> >> > > > > > this in error, please contact the sender and delete the
> material
> >> > from
> >> > > > any
> >> > > > > > computer.
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: kafka broker loosing offsets?

Posted by Dmitriy Vsekhvalnov <dv...@gmail.com>.

Hey guys,

just want to post that upgrade to 0.11.0.1 solved the issue. After
excessive disaster testing no re-consumption of old offsets were
experienced.



On Thu, Oct 12, 2017 at 1:35 AM, Vincent Dautremont <
vincent.dautremont@olamobile.com.invalid> wrote:

> Hi,
> We have 4 differents Kafka cluster running,
> 2 on 0.10.1.0
> 1 on 0.10.0.1
> 1 that was on 0.11.0.0 and last week updated to 0.11.0.1
>
> I’ve only seen the issue happen 2 times in production usage on the cluster
> on 0.11.0.0 since it’s running (about 3months).
>
> But I’ll monitor and report it here if it ever happen again in the future.
> We’ll also upgrade all our clusters to 0.11.0.1 in the next days.
>
> 🤞🏻!
>
> > Le 11 oct. 2017 à 17:47, Dmitriy Vsekhvalnov <dv...@gmail.com> a
> écrit :
> >
> > Yeah just pops up in my list. Thanks, i'll take a look.
> >
> > Vincent Dautremont, if you still reading it, did you try upgrade to
> > 0.11.0.1? Fixed issue?
> >
> > On Wed, Oct 11, 2017 at 6:46 PM, Ben Davison <be...@7digital.com>
> > wrote:
> >
> >> Hi Dmitriy,
> >>
> >> Did you check out this thread "Incorrect consumer offsets after broker
> >> restart 0.11.0.0" from Phil Luckhurst, it sounds similar.
> >>
> >> Thanks,
> >>
> >> Ben
> >>
> >> On Wed, Oct 11, 2017 at 4:44 PM Dmitriy Vsekhvalnov <
> >> dvsekhvalnov@gmail.com>
> >> wrote:
> >>
> >>> Hey, want to resurrect this thread.
> >>>
> >>> Decided to do idle test, where no load data is produced to topic at
> all.
> >>> And when we kill #101 or #102 - nothing happening. But when we kill
> #200
> >> -
> >>> consumers starts to re-consume old events from random position.
> >>>
> >>> Anybody have ideas what to check?  I really expected that Kafka will
> fail
> >>> symmetrical with respect to any broker.
> >>>
> >>> On Mon, Oct 9, 2017 at 6:26 PM, Dmitriy Vsekhvalnov <
> >>> dvsekhvalnov@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi tao,
> >>>>
> >>>> we had unclean leader election enabled at the beginning. But then
> >>> disabled
> >>>> it and also reduced 'max.poll.records' value.  It helped little bit.
> >>>>
> >>>> But after today's testing there is strong correlation between lag
> spike
> >>>> and what broker we crash. For lowest ID (100) broker :
> >>>>  1. it always at least 1-2 orders higher lag
> >>>>  2. we start getting
> >>>>
> >>>> org.apache.kafka.clients.consumer.CommitFailedException: Commit
> >> cannot be
> >>>> completed since the group has already rebalanced and assigned the
> >>>> partitions to another member. This means that the time between
> >> subsequent
> >>>> calls to poll() was longer than the configured max.poll.interval.ms,
> >>>> which typically implies that the poll loop is spending too much time
> >>>> message processing. You can address this either by increasing the
> >> session
> >>>> timeout or by reducing the maximum size of batches returned in poll()
> >>> with
> >>>> max.poll.records.
> >>>>
> >>>>  3. sometime re-consumption from random position
> >>>>
> >>>> And when we crashing other brokers (101, 102), it just lag spike of
> >> ~10Ks
> >>>> order, settle down quite quickly, no consumer exceptions.
> >>>>
> >>>> Totally lost what to try next.
> >>>>
> >>>>> On Sat, Oct 7, 2017 at 2:41 AM, tao xiao <xi...@gmail.com>
> wrote:
> >>>>>
> >>>>> Do you have unclean leader election turned on? If killing 100 is the
> >>> only
> >>>>> way to reproduce the problem, it is possible with unclean leader
> >>> election
> >>>>> turned on that leadership was transferred to out of ISR follower
> which
> >>> may
> >>>>> not have the latest high watermark
> >>>>> On Sat, Oct 7, 2017 at 3:51 AM Dmitriy Vsekhvalnov <
> >>>>> dvsekhvalnov@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> About to verify hypothesis on monday, but looks like that in latest
> >>>>> tests.
> >>>>>> Need to double check.
> >>>>>>
> >>>>>> On Fri, Oct 6, 2017 at 11:25 PM, Stas Chizhov <sc...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>>> So no matter in what sequence you shutdown brokers it is only 1
> >> that
> >>>>>> causes
> >>>>>>> the major problem? That would indeed be a bit weird. have you
> >>> checked
> >>>>>>> offsets of your consumer - right after offsets jump back - does it
> >>>>> start
> >>>>>>> from the topic start or does it go back to some random position?
> >>> Have
> >>>>> you
> >>>>>>> checked if all offsets are actually being committed by consumers?
> >>>>>>>
> >>>>>>> fre 6 okt. 2017 kl. 20:59 skrev Dmitriy Vsekhvalnov <
> >>>>>>> dvsekhvalnov@gmail.com
> >>>>>>>> :
> >>>>>>>
> >>>>>>>> Yeah, probably we can dig around.
> >>>>>>>>
> >>>>>>>> One more observation, the most lag/re-consumption trouble
> >>> happening
> >>>>>> when
> >>>>>>> we
> >>>>>>>> kill broker with lowest id (e.g. 100 from [100,101,102]).
> >>>>>>>> When crashing other brokers - there is nothing special
> >> happening,
> >>>>> lag
> >>>>>>>> growing little bit but nothing crazy (e.g. thousands, not
> >>> millions).
> >>>>>>>>
> >>>>>>>> Is it sounds suspicious?
> >>>>>>>>
> >>>>>>>> On Fri, Oct 6, 2017 at 9:23 PM, Stas Chizhov <
> >> schizhov@gmail.com>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Ted: when choosing earliest/latest you are saying: if it
> >> happens
> >>>>> that
> >>>>>>>> there
> >>>>>>>>> is no "valid" offset committed for a consumer (for whatever
> >>>>> reason:
> >>>>>>>>> bug/misconfiguration/no luck) it will be ok to start from the
> >>>>>> beginning
> >>>>>>>> or
> >>>>>>>>> end of the topic. So if you are not ok with that you should
> >>> choose
> >>>>>>> none.
> >>>>>>>>>
> >>>>>>>>> Dmitriy: Ok. Then it is spring-kafka that maintains this
> >> offset
> >>>>> per
> >>>>>>>>> partition state for you. it might also has that problem of
> >>> leaving
> >>>>>>> stale
> >>>>>>>>> offsets lying around, After quickly looking through
> >>>>>>>>> https://github.com/spring-projects/spring-kafka/blob/
> >>>>>>>>> 1945f29d5518e3c4a9950ba82135420dfb61e808/spring-kafka/src/
> >>>>>>>>> main/java/org/springframework/kafka/listener/
> >>>>>>>>> KafkaMessageListenerContainer.java
> >>>>>>>>> it looks possible since offsets map is not cleared upon
> >>> partition
> >>>>>>>>> revocation, but that is just a hypothesis. I have no
> >> experience
> >>>>> with
> >>>>>>>>> spring-kafka. However since you say you consumers were always
> >>>>> active
> >>>>>> I
> >>>>>>>> find
> >>>>>>>>> this theory worth investigating.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 2017-10-06 18:20 GMT+02:00 Vincent Dautremont <
> >>>>>>>>> vincent.dautremont@olamobile.com.invalid>:
> >>>>>>>>>
> >>>>>>>>>> is there a way to read messages on a topic partition from a
> >>>>>> specific
> >>>>>>>> node
> >>>>>>>>>> we that we choose (and not by the topic partition leader) ?
> >>>>>>>>>> I would like to read myself that each of the
> >>> __consumer_offsets
> >>>>>>>> partition
> >>>>>>>>>> replicas have the same consumer group offset written in it
> >> in
> >>>>> it.
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov <
> >>>>>>>>>> dvsekhvalnov@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Stas:
> >>>>>>>>>>>
> >>>>>>>>>>> we rely on spring-kafka, it  commits offsets "manually"
> >> for
> >>> us
> >>>>>>> after
> >>>>>>>>>> event
> >>>>>>>>>>> handler completed. So it's kind of automatic once there is
> >>>>>> constant
> >>>>>>>>>> stream
> >>>>>>>>>>> of events (no idle time, which is true for us). Though
> >> it's
> >>>>> not
> >>>>>>> what
> >>>>>>>>> pure
> >>>>>>>>>>> kafka-client calls "automatic" (flush commits at fixed
> >>>>>> intervals).
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov <
> >>>>> schizhov@gmail.com
> >>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> You don't have autocmmit enables that means you commit
> >>>>> offsets
> >>>>>>>>>> yourself -
> >>>>>>>>>>>> correct? If you store them per partition somewhere and
> >>> fail
> >>>>> to
> >>>>>>>> clean
> >>>>>>>>> it
> >>>>>>>>>>> up
> >>>>>>>>>>>> upon rebalance next time the consumer gets this
> >> partition
> >>>>>>> assigned
> >>>>>>>>>> during
> >>>>>>>>>>>> next rebalance it can commit old stale offset- can this
> >> be
> >>>>> the
> >>>>>>>> case?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <
> >>>>>>>>>>>> dvsekhvalnov@gmail.com
> >>>>>>>>>>>>> :
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Reprocessing same events again - is fine for us
> >>>>> (idempotent).
> >>>>>>>> While
> >>>>>>>>>>>> loosing
> >>>>>>>>>>>>> data is more critical.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> What are reasons of such behaviour? Consumers are
> >> never
> >>>>> idle,
> >>>>>>>>> always
> >>>>>>>>>>>>> commiting, probably something wrong with broker setup
> >>>>> then?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <
> >>>>> yuzhihong@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Stas:
> >>>>>>>>>>>>>> bq.  using anything but none is not really an option
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> If you have time, can you explain a bit more ?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <
> >>>>>>>> schizhov@gmail.com
> >>>>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If you set auto.offset.reset to none next time it
> >>>>> happens
> >>>>>>> you
> >>>>>>>>>> will
> >>>>>>>>>>> be
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>>> much better position to find out what happens.
> >> Also
> >>> in
> >>>>>>>> general
> >>>>>>>>>> with
> >>>>>>>>>>>>>> current
> >>>>>>>>>>>>>>> semantics of offset reset policy IMO using
> >> anything
> >>>>> but
> >>>>>>> none
> >>>>>>>> is
> >>>>>>>>>> not
> >>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>> an option unless it is ok for consumer to loose
> >> some
> >>>>> data
> >>>>>>>>>> (latest)
> >>>>>>>>>>> or
> >>>>>>>>>>>>>>> reprocess it second time (earliest).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <
> >>>>>>> yuzhihong@gmail.com
> >>>>>>>>> :
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Should Kafka log warning if log.retention.hours
> >> is
> >>>>>> lower
> >>>>>>>> than
> >>>>>>>>>>>> number
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>> hours specified by offsets.retention.minutes ?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <
> >>>>>>>>>>>> manikumar.reddy@gmail.com
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> normally, log.retention.hours (168hrs)  should
> >>> be
> >>>>>>> higher
> >>>>>>>>> than
> >>>>>>>>>>>>>>>>> offsets.retention.minutes (336 hrs)?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy
> >>>>> Vsekhvalnov <
> >>>>>>>>>>>>>>>>> dvsekhvalnov@gmail.com>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Ted,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Broker: v0.11.0.0
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Consumer:
> >>>>>>>>>>>>>>>>>> kafka-clients v0.11.0.0
> >>>>>>>>>>>>>>>>>> auto.offset.reset = earliest
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <
> >>>>>>>>>> yuzhihong@gmail.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> What's the value for auto.offset.reset  ?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Which release are you using ?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Cheers
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy
> >>>>>>> Vsekhvalnov <
> >>>>>>>>>>>>>>>>>>> dvsekhvalnov@gmail.com>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> we several time faced situation where
> >>>>>>>> consumer-group
> >>>>>>>>>>>> started
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> re-consume
> >>>>>>>>>>>>>>>>>>>> old events from beginning. Here is
> >>> scenario:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 1. x3 broker kafka cluster on top of x3
> >>> node
> >>>>>>>>> zookeeper
> >>>>>>>>>>>>>>>>>>>> 2. RF=3 for all topics
> >>>>>>>>>>>>>>>>>>>> 3. log.retention.hours=168 and
> >>>>>>>>>>>>> offsets.retention.minutes=20160
> >>>>>>>>>>>>>>>>>>>> 4. running sustainable load (pushing
> >>> events)
> >>>>>>>>>>>>>>>>>>>> 5. doing disaster testing by randomly
> >>>>> shutting
> >>>>>>>> down 1
> >>>>>>>>>> of
> >>>>>>>>>>> 3
> >>>>>>>>>>>>>> broker
> >>>>>>>>>>>>>>>>> nodes
> >>>>>>>>>>>>>>>>>>>> (then provision new broker back)
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Several times after bouncing broker we
> >>> faced
> >>>>>>>>> situation
> >>>>>>>>>>>> where
> >>>>>>>>>>>>>>>> consumer
> >>>>>>>>>>>>>>>>>>> group
> >>>>>>>>>>>>>>>>>>>> started to re-consume old events.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> consumer group:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 1. enable.auto.commit = false
> >>>>>>>>>>>>>>>>>>>> 2. tried graceful group shutdown, kill
> >> -9
> >>>>> and
> >>>>>>>>>> terminating
> >>>>>>>>>>>> AWS
> >>>>>>>>>>>>>>> nodes
> >>>>>>>>>>>>>>>>>>>> 3. never experienced re-consumption for
> >>>>> given
> >>>>>>>> cases.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> What can cause that old events
> >>>>> re-consumption?
> >>>>>> Is
> >>>>>>>> it
> >>>>>>>>>>>> related
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> bouncing
> >>>>>>>>>>>>>>>>>>>> one of brokers? What to search in a
> >> logs?
> >>>>> Any
> >>>>>>>> broker
> >>>>>>>>>>>> settings
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> try?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks in advance.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> The information transmitted is intended only for the person
> >> or
> >>>>>> entity
> >>>>>>>> to
> >>>>>>>>>> which it is addressed and may contain confidential and/or
> >>>>>> privileged
> >>>>>>>>>> material. Any review, retransmission, dissemination or other
> >>> use
> >>>>>> of,
> >>>>>>> or
> >>>>>>>>>> taking of any action in reliance upon, this information by
> >>>>> persons
> >>>>>> or
> >>>>>>>>>> entities other than the intended recipient is prohibited. If
> >>> you
> >>>>>>>> received
> >>>>>>>>>> this in error, please contact the sender and delete the
> >>> material
> >>>>>> from
> >>>>>>>> any
> >>>>>>>>>> computer.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >> --
> >>
> >>
> >> This email, including attachments, is private and confidential. If you
> have
> >> received this email in error please notify the sender and delete it from
> >> your system. Emails are not secure and may contain viruses. No liability
> >> can be accepted for viruses that might be transferred by this email or
> any
> >> attachment. Any unauthorised copying of this message or unauthorised
> >> distribution and publication of the information contained herein are
> >> prohibited.
> >>
> >> 7digital Group plc. Registered office: 69 Wilson Street, London EC2A
> 2BB.
> >> Registered in England and Wales. Registered No. 04843573.
> >>
>
> --
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from any
> computer.
>

Re: kafka broker loosing offsets?

Posted by Vincent Dautremont <vi...@olamobile.com.INVALID>.

Hi,
We have 4 differents Kafka cluster running,
2 on 0.10.1.0
1 on 0.10.0.1
1 that was on 0.11.0.0 and last week updated to 0.11.0.1

I’ve only seen the issue happen 2 times in production usage on the cluster on 0.11.0.0 since it’s running (about 3months).

But I’ll monitor and report it here if it ever happen again in the future. We’ll also upgrade all our clusters to 0.11.0.1 in the next days.

🤞🏻!

> Le 11 oct. 2017 à 17:47, Dmitriy Vsekhvalnov <dv...@gmail.com> a écrit :
> 
> Yeah just pops up in my list. Thanks, i'll take a look.
> 
> Vincent Dautremont, if you still reading it, did you try upgrade to
> 0.11.0.1? Fixed issue?
> 
> On Wed, Oct 11, 2017 at 6:46 PM, Ben Davison <be...@7digital.com>
> wrote:
> 
>> Hi Dmitriy,
>> 
>> Did you check out this thread "Incorrect consumer offsets after broker
>> restart 0.11.0.0" from Phil Luckhurst, it sounds similar.
>> 
>> Thanks,
>> 
>> Ben
>> 
>> On Wed, Oct 11, 2017 at 4:44 PM Dmitriy Vsekhvalnov <
>> dvsekhvalnov@gmail.com>
>> wrote:
>> 
>>> Hey, want to resurrect this thread.
>>> 
>>> Decided to do idle test, where no load data is produced to topic at all.
>>> And when we kill #101 or #102 - nothing happening. But when we kill #200
>> -
>>> consumers starts to re-consume old events from random position.
>>> 
>>> Anybody have ideas what to check?  I really expected that Kafka will fail
>>> symmetrical with respect to any broker.
>>> 
>>> On Mon, Oct 9, 2017 at 6:26 PM, Dmitriy Vsekhvalnov <
>>> dvsekhvalnov@gmail.com>
>>> wrote:
>>> 
>>>> Hi tao,
>>>> 
>>>> we had unclean leader election enabled at the beginning. But then
>>> disabled
>>>> it and also reduced 'max.poll.records' value.  It helped little bit.
>>>> 
>>>> But after today's testing there is strong correlation between lag spike
>>>> and what broker we crash. For lowest ID (100) broker :
>>>>  1. it always at least 1-2 orders higher lag
>>>>  2. we start getting
>>>> 
>>>> org.apache.kafka.clients.consumer.CommitFailedException: Commit
>> cannot be
>>>> completed since the group has already rebalanced and assigned the
>>>> partitions to another member. This means that the time between
>> subsequent
>>>> calls to poll() was longer than the configured max.poll.interval.ms,
>>>> which typically implies that the poll loop is spending too much time
>>>> message processing. You can address this either by increasing the
>> session
>>>> timeout or by reducing the maximum size of batches returned in poll()
>>> with
>>>> max.poll.records.
>>>> 
>>>>  3. sometime re-consumption from random position
>>>> 
>>>> And when we crashing other brokers (101, 102), it just lag spike of
>> ~10Ks
>>>> order, settle down quite quickly, no consumer exceptions.
>>>> 
>>>> Totally lost what to try next.
>>>> 
>>>>> On Sat, Oct 7, 2017 at 2:41 AM, tao xiao <xi...@gmail.com> wrote:
>>>>> 
>>>>> Do you have unclean leader election turned on? If killing 100 is the
>>> only
>>>>> way to reproduce the problem, it is possible with unclean leader
>>> election
>>>>> turned on that leadership was transferred to out of ISR follower which
>>> may
>>>>> not have the latest high watermark
>>>>> On Sat, Oct 7, 2017 at 3:51 AM Dmitriy Vsekhvalnov <
>>>>> dvsekhvalnov@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> About to verify hypothesis on monday, but looks like that in latest
>>>>> tests.
>>>>>> Need to double check.
>>>>>> 
>>>>>> On Fri, Oct 6, 2017 at 11:25 PM, Stas Chizhov <sc...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>>> So no matter in what sequence you shutdown brokers it is only 1
>> that
>>>>>> causes
>>>>>>> the major problem? That would indeed be a bit weird. have you
>>> checked
>>>>>>> offsets of your consumer - right after offsets jump back - does it
>>>>> start
>>>>>>> from the topic start or does it go back to some random position?
>>> Have
>>>>> you
>>>>>>> checked if all offsets are actually being committed by consumers?
>>>>>>> 
>>>>>>> fre 6 okt. 2017 kl. 20:59 skrev Dmitriy Vsekhvalnov <
>>>>>>> dvsekhvalnov@gmail.com
>>>>>>>> :
>>>>>>> 
>>>>>>>> Yeah, probably we can dig around.
>>>>>>>> 
>>>>>>>> One more observation, the most lag/re-consumption trouble
>>> happening
>>>>>> when
>>>>>>> we
>>>>>>>> kill broker with lowest id (e.g. 100 from [100,101,102]).
>>>>>>>> When crashing other brokers - there is nothing special
>> happening,
>>>>> lag
>>>>>>>> growing little bit but nothing crazy (e.g. thousands, not
>>> millions).
>>>>>>>> 
>>>>>>>> Is it sounds suspicious?
>>>>>>>> 
>>>>>>>> On Fri, Oct 6, 2017 at 9:23 PM, Stas Chizhov <
>> schizhov@gmail.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Ted: when choosing earliest/latest you are saying: if it
>> happens
>>>>> that
>>>>>>>> there
>>>>>>>>> is no "valid" offset committed for a consumer (for whatever
>>>>> reason:
>>>>>>>>> bug/misconfiguration/no luck) it will be ok to start from the
>>>>>> beginning
>>>>>>>> or
>>>>>>>>> end of the topic. So if you are not ok with that you should
>>> choose
>>>>>>> none.
>>>>>>>>> 
>>>>>>>>> Dmitriy: Ok. Then it is spring-kafka that maintains this
>> offset
>>>>> per
>>>>>>>>> partition state for you. it might also has that problem of
>>> leaving
>>>>>>> stale
>>>>>>>>> offsets lying around, After quickly looking through
>>>>>>>>> https://github.com/spring-projects/spring-kafka/blob/
>>>>>>>>> 1945f29d5518e3c4a9950ba82135420dfb61e808/spring-kafka/src/
>>>>>>>>> main/java/org/springframework/kafka/listener/
>>>>>>>>> KafkaMessageListenerContainer.java
>>>>>>>>> it looks possible since offsets map is not cleared upon
>>> partition
>>>>>>>>> revocation, but that is just a hypothesis. I have no
>> experience
>>>>> with
>>>>>>>>> spring-kafka. However since you say you consumers were always
>>>>> active
>>>>>> I
>>>>>>>> find
>>>>>>>>> this theory worth investigating.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 2017-10-06 18:20 GMT+02:00 Vincent Dautremont <
>>>>>>>>> vincent.dautremont@olamobile.com.invalid>:
>>>>>>>>> 
>>>>>>>>>> is there a way to read messages on a topic partition from a
>>>>>> specific
>>>>>>>> node
>>>>>>>>>> we that we choose (and not by the topic partition leader) ?
>>>>>>>>>> I would like to read myself that each of the
>>> __consumer_offsets
>>>>>>>> partition
>>>>>>>>>> replicas have the same consumer group offset written in it
>> in
>>>>> it.
>>>>>>>>>> 
>>>>>>>>>> On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov <
>>>>>>>>>> dvsekhvalnov@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Stas:
>>>>>>>>>>> 
>>>>>>>>>>> we rely on spring-kafka, it  commits offsets "manually"
>> for
>>> us
>>>>>>> after
>>>>>>>>>> event
>>>>>>>>>>> handler completed. So it's kind of automatic once there is
>>>>>> constant
>>>>>>>>>> stream
>>>>>>>>>>> of events (no idle time, which is true for us). Though
>> it's
>>>>> not
>>>>>>> what
>>>>>>>>> pure
>>>>>>>>>>> kafka-client calls "automatic" (flush commits at fixed
>>>>>> intervals).
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov <
>>>>> schizhov@gmail.com
>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> You don't have autocmmit enables that means you commit
>>>>> offsets
>>>>>>>>>> yourself -
>>>>>>>>>>>> correct? If you store them per partition somewhere and
>>> fail
>>>>> to
>>>>>>>> clean
>>>>>>>>> it
>>>>>>>>>>> up
>>>>>>>>>>>> upon rebalance next time the consumer gets this
>> partition
>>>>>>> assigned
>>>>>>>>>> during
>>>>>>>>>>>> next rebalance it can commit old stale offset- can this
>> be
>>>>> the
>>>>>>>> case?
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <
>>>>>>>>>>>> dvsekhvalnov@gmail.com
>>>>>>>>>>>>> :
>>>>>>>>>>>> 
>>>>>>>>>>>>> Reprocessing same events again - is fine for us
>>>>> (idempotent).
>>>>>>>> While
>>>>>>>>>>>> loosing
>>>>>>>>>>>>> data is more critical.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> What are reasons of such behaviour? Consumers are
>> never
>>>>> idle,
>>>>>>>>> always
>>>>>>>>>>>>> commiting, probably something wrong with broker setup
>>>>> then?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <
>>>>> yuzhihong@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Stas:
>>>>>>>>>>>>>> bq.  using anything but none is not really an option
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If you have time, can you explain a bit more ?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <
>>>>>>>> schizhov@gmail.com
>>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If you set auto.offset.reset to none next time it
>>>>> happens
>>>>>>> you
>>>>>>>>>> will
>>>>>>>>>>> be
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>> much better position to find out what happens.
>> Also
>>> in
>>>>>>>> general
>>>>>>>>>> with
>>>>>>>>>>>>>> current
>>>>>>>>>>>>>>> semantics of offset reset policy IMO using
>> anything
>>>>> but
>>>>>>> none
>>>>>>>> is
>>>>>>>>>> not
>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>> an option unless it is ok for consumer to loose
>> some
>>>>> data
>>>>>>>>>> (latest)
>>>>>>>>>>> or
>>>>>>>>>>>>>>> reprocess it second time (earliest).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <
>>>>>>> yuzhihong@gmail.com
>>>>>>>>> :
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Should Kafka log warning if log.retention.hours
>> is
>>>>>> lower
>>>>>>>> than
>>>>>>>>>>>> number
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> hours specified by offsets.retention.minutes ?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <
>>>>>>>>>>>> manikumar.reddy@gmail.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> normally, log.retention.hours (168hrs)  should
>>> be
>>>>>>> higher
>>>>>>>>> than
>>>>>>>>>>>>>>>>> offsets.retention.minutes (336 hrs)?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy
>>>>> Vsekhvalnov <
>>>>>>>>>>>>>>>>> dvsekhvalnov@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Ted,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Broker: v0.11.0.0
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Consumer:
>>>>>>>>>>>>>>>>>> kafka-clients v0.11.0.0
>>>>>>>>>>>>>>>>>> auto.offset.reset = earliest
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <
>>>>>>>>>> yuzhihong@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> What's the value for auto.offset.reset  ?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Which release are you using ?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy
>>>>>>> Vsekhvalnov <
>>>>>>>>>>>>>>>>>>> dvsekhvalnov@gmail.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> we several time faced situation where
>>>>>>>> consumer-group
>>>>>>>>>>>> started
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> re-consume
>>>>>>>>>>>>>>>>>>>> old events from beginning. Here is
>>> scenario:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 1. x3 broker kafka cluster on top of x3
>>> node
>>>>>>>>> zookeeper
>>>>>>>>>>>>>>>>>>>> 2. RF=3 for all topics
>>>>>>>>>>>>>>>>>>>> 3. log.retention.hours=168 and
>>>>>>>>>>>>> offsets.retention.minutes=20160
>>>>>>>>>>>>>>>>>>>> 4. running sustainable load (pushing
>>> events)
>>>>>>>>>>>>>>>>>>>> 5. doing disaster testing by randomly
>>>>> shutting
>>>>>>>> down 1
>>>>>>>>>> of
>>>>>>>>>>> 3
>>>>>>>>>>>>>> broker
>>>>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>>>>>> (then provision new broker back)
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Several times after bouncing broker we
>>> faced
>>>>>>>>> situation
>>>>>>>>>>>> where
>>>>>>>>>>>>>>>> consumer
>>>>>>>>>>>>>>>>>>> group
>>>>>>>>>>>>>>>>>>>> started to re-consume old events.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> consumer group:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 1. enable.auto.commit = false
>>>>>>>>>>>>>>>>>>>> 2. tried graceful group shutdown, kill
>> -9
>>>>> and
>>>>>>>>>> terminating
>>>>>>>>>>>> AWS
>>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>>>>>> 3. never experienced re-consumption for
>>>>> given
>>>>>>>> cases.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> What can cause that old events
>>>>> re-consumption?
>>>>>> Is
>>>>>>>> it
>>>>>>>>>>>> related
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> bouncing
>>>>>>>>>>>>>>>>>>>> one of brokers? What to search in a
>> logs?
>>>>> Any
>>>>>>>> broker
>>>>>>>>>>>> settings
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> try?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks in advance.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> The information transmitted is intended only for the person
>> or
>>>>>> entity
>>>>>>>> to
>>>>>>>>>> which it is addressed and may contain confidential and/or
>>>>>> privileged
>>>>>>>>>> material. Any review, retransmission, dissemination or other
>>> use
>>>>>> of,
>>>>>>> or
>>>>>>>>>> taking of any action in reliance upon, this information by
>>>>> persons
>>>>>> or
>>>>>>>>>> entities other than the intended recipient is prohibited. If
>>> you
>>>>>>>> received
>>>>>>>>>> this in error, please contact the sender and delete the
>>> material
>>>>>> from
>>>>>>>> any
>>>>>>>>>> computer.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> --
>> 
>> 
>> This email, including attachments, is private and confidential. If you have
>> received this email in error please notify the sender and delete it from
>> your system. Emails are not secure and may contain viruses. No liability
>> can be accepted for viruses that might be transferred by this email or any
>> attachment. Any unauthorised copying of this message or unauthorised
>> distribution and publication of the information contained herein are
>> prohibited.
>> 
>> 7digital Group plc. Registered office: 69 Wilson Street, London EC2A 2BB.
>> Registered in England and Wales. Registered No. 04843573.
>> 

-- 
The information transmitted is intended only for the person or entity to 
which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or 
taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received 
this in error, please contact the sender and delete the material from any 
computer.

Re: kafka broker loosing offsets?

Posted by Dmitriy Vsekhvalnov <dv...@gmail.com>.

Yeah just pops up in my list. Thanks, i'll take a look.

Vincent Dautremont, if you still reading it, did you try upgrade to
0.11.0.1? Fixed issue?

On Wed, Oct 11, 2017 at 6:46 PM, Ben Davison <be...@7digital.com>
wrote:

> Hi Dmitriy,
>
> Did you check out this thread "Incorrect consumer offsets after broker
> restart 0.11.0.0" from Phil Luckhurst, it sounds similar.
>
> Thanks,
>
> Ben
>
> On Wed, Oct 11, 2017 at 4:44 PM Dmitriy Vsekhvalnov <
> dvsekhvalnov@gmail.com>
> wrote:
>
> > Hey, want to resurrect this thread.
> >
> > Decided to do idle test, where no load data is produced to topic at all.
> > And when we kill #101 or #102 - nothing happening. But when we kill #200
> -
> > consumers starts to re-consume old events from random position.
> >
> > Anybody have ideas what to check?  I really expected that Kafka will fail
> > symmetrical with respect to any broker.
> >
> > On Mon, Oct 9, 2017 at 6:26 PM, Dmitriy Vsekhvalnov <
> > dvsekhvalnov@gmail.com>
> > wrote:
> >
> > > Hi tao,
> > >
> > > we had unclean leader election enabled at the beginning. But then
> > disabled
> > > it and also reduced 'max.poll.records' value.  It helped little bit.
> > >
> > > But after today's testing there is strong correlation between lag spike
> > > and what broker we crash. For lowest ID (100) broker :
> > >   1. it always at least 1-2 orders higher lag
> > >   2. we start getting
> > >
> > > org.apache.kafka.clients.consumer.CommitFailedException: Commit
> cannot be
> > > completed since the group has already rebalanced and assigned the
> > > partitions to another member. This means that the time between
> subsequent
> > > calls to poll() was longer than the configured max.poll.interval.ms,
> > > which typically implies that the poll loop is spending too much time
> > > message processing. You can address this either by increasing the
> session
> > > timeout or by reducing the maximum size of batches returned in poll()
> > with
> > > max.poll.records.
> > >
> > >   3. sometime re-consumption from random position
> > >
> > > And when we crashing other brokers (101, 102), it just lag spike of
> ~10Ks
> > > order, settle down quite quickly, no consumer exceptions.
> > >
> > > Totally lost what to try next.
> > >
> > > On Sat, Oct 7, 2017 at 2:41 AM, tao xiao <xi...@gmail.com> wrote:
> > >
> > >> Do you have unclean leader election turned on? If killing 100 is the
> > only
> > >> way to reproduce the problem, it is possible with unclean leader
> > election
> > >> turned on that leadership was transferred to out of ISR follower which
> > may
> > >> not have the latest high watermark
> > >> On Sat, Oct 7, 2017 at 3:51 AM Dmitriy Vsekhvalnov <
> > >> dvsekhvalnov@gmail.com>
> > >> wrote:
> > >>
> > >> > About to verify hypothesis on monday, but looks like that in latest
> > >> tests.
> > >> > Need to double check.
> > >> >
> > >> > On Fri, Oct 6, 2017 at 11:25 PM, Stas Chizhov <sc...@gmail.com>
> > >> wrote:
> > >> >
> > >> > > So no matter in what sequence you shutdown brokers it is only 1
> that
> > >> > causes
> > >> > > the major problem? That would indeed be a bit weird. have you
> > checked
> > >> > > offsets of your consumer - right after offsets jump back - does it
> > >> start
> > >> > > from the topic start or does it go back to some random position?
> > Have
> > >> you
> > >> > > checked if all offsets are actually being committed by consumers?
> > >> > >
> > >> > > fre 6 okt. 2017 kl. 20:59 skrev Dmitriy Vsekhvalnov <
> > >> > > dvsekhvalnov@gmail.com
> > >> > > >:
> > >> > >
> > >> > > > Yeah, probably we can dig around.
> > >> > > >
> > >> > > > One more observation, the most lag/re-consumption trouble
> > happening
> > >> > when
> > >> > > we
> > >> > > > kill broker with lowest id (e.g. 100 from [100,101,102]).
> > >> > > > When crashing other brokers - there is nothing special
> happening,
> > >> lag
> > >> > > > growing little bit but nothing crazy (e.g. thousands, not
> > millions).
> > >> > > >
> > >> > > > Is it sounds suspicious?
> > >> > > >
> > >> > > > On Fri, Oct 6, 2017 at 9:23 PM, Stas Chizhov <
> schizhov@gmail.com>
> > >> > wrote:
> > >> > > >
> > >> > > > > Ted: when choosing earliest/latest you are saying: if it
> happens
> > >> that
> > >> > > > there
> > >> > > > > is no "valid" offset committed for a consumer (for whatever
> > >> reason:
> > >> > > > > bug/misconfiguration/no luck) it will be ok to start from the
> > >> > beginning
> > >> > > > or
> > >> > > > > end of the topic. So if you are not ok with that you should
> > choose
> > >> > > none.
> > >> > > > >
> > >> > > > > Dmitriy: Ok. Then it is spring-kafka that maintains this
> offset
> > >> per
> > >> > > > > partition state for you. it might also has that problem of
> > leaving
> > >> > > stale
> > >> > > > > offsets lying around, After quickly looking through
> > >> > > > > https://github.com/spring-projects/spring-kafka/blob/
> > >> > > > > 1945f29d5518e3c4a9950ba82135420dfb61e808/spring-kafka/src/
> > >> > > > > main/java/org/springframework/kafka/listener/
> > >> > > > > KafkaMessageListenerContainer.java
> > >> > > > > it looks possible since offsets map is not cleared upon
> > partition
> > >> > > > > revocation, but that is just a hypothesis. I have no
> experience
> > >> with
> > >> > > > > spring-kafka. However since you say you consumers were always
> > >> active
> > >> > I
> > >> > > > find
> > >> > > > > this theory worth investigating.
> > >> > > > >
> > >> > > > >
> > >> > > > > 2017-10-06 18:20 GMT+02:00 Vincent Dautremont <
> > >> > > > > vincent.dautremont@olamobile.com.invalid>:
> > >> > > > >
> > >> > > > > > is there a way to read messages on a topic partition from a
> > >> > specific
> > >> > > > node
> > >> > > > > > we that we choose (and not by the topic partition leader) ?
> > >> > > > > > I would like to read myself that each of the
> > __consumer_offsets
> > >> > > > partition
> > >> > > > > > replicas have the same consumer group offset written in it
> in
> > >> it.
> > >> > > > > >
> > >> > > > > > On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov <
> > >> > > > > > dvsekhvalnov@gmail.com>
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Stas:
> > >> > > > > > >
> > >> > > > > > > we rely on spring-kafka, it  commits offsets "manually"
> for
> > us
> > >> > > after
> > >> > > > > > event
> > >> > > > > > > handler completed. So it's kind of automatic once there is
> > >> > constant
> > >> > > > > > stream
> > >> > > > > > > of events (no idle time, which is true for us). Though
> it's
> > >> not
> > >> > > what
> > >> > > > > pure
> > >> > > > > > > kafka-client calls "automatic" (flush commits at fixed
> > >> > intervals).
> > >> > > > > > >
> > >> > > > > > > On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov <
> > >> schizhov@gmail.com
> > >> > >
> > >> > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > You don't have autocmmit enables that means you commit
> > >> offsets
> > >> > > > > > yourself -
> > >> > > > > > > > correct? If you store them per partition somewhere and
> > fail
> > >> to
> > >> > > > clean
> > >> > > > > it
> > >> > > > > > > up
> > >> > > > > > > > upon rebalance next time the consumer gets this
> partition
> > >> > > assigned
> > >> > > > > > during
> > >> > > > > > > > next rebalance it can commit old stale offset- can this
> be
> > >> the
> > >> > > > case?
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <
> > >> > > > > > > > dvsekhvalnov@gmail.com
> > >> > > > > > > > >:
> > >> > > > > > > >
> > >> > > > > > > > > Reprocessing same events again - is fine for us
> > >> (idempotent).
> > >> > > > While
> > >> > > > > > > > loosing
> > >> > > > > > > > > data is more critical.
> > >> > > > > > > > >
> > >> > > > > > > > > What are reasons of such behaviour? Consumers are
> never
> > >> idle,
> > >> > > > > always
> > >> > > > > > > > > commiting, probably something wrong with broker setup
> > >> then?
> > >> > > > > > > > >
> > >> > > > > > > > > On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <
> > >> yuzhihong@gmail.com>
> > >> > > > > wrote:
> > >> > > > > > > > >
> > >> > > > > > > > > > Stas:
> > >> > > > > > > > > > bq.  using anything but none is not really an option
> > >> > > > > > > > > >
> > >> > > > > > > > > > If you have time, can you explain a bit more ?
> > >> > > > > > > > > >
> > >> > > > > > > > > > Thanks
> > >> > > > > > > > > >
> > >> > > > > > > > > > On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <
> > >> > > > schizhov@gmail.com
> > >> > > > > >
> > >> > > > > > > > wrote:
> > >> > > > > > > > > >
> > >> > > > > > > > > > > If you set auto.offset.reset to none next time it
> > >> happens
> > >> > > you
> > >> > > > > > will
> > >> > > > > > > be
> > >> > > > > > > > > in
> > >> > > > > > > > > > > much better position to find out what happens.
> Also
> > in
> > >> > > > general
> > >> > > > > > with
> > >> > > > > > > > > > current
> > >> > > > > > > > > > > semantics of offset reset policy IMO using
> anything
> > >> but
> > >> > > none
> > >> > > > is
> > >> > > > > > not
> > >> > > > > > > > > > really
> > >> > > > > > > > > > > an option unless it is ok for consumer to loose
> some
> > >> data
> > >> > > > > > (latest)
> > >> > > > > > > or
> > >> > > > > > > > > > > reprocess it second time (earliest).
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <
> > >> > > yuzhihong@gmail.com
> > >> > > > >:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > > Should Kafka log warning if log.retention.hours
> is
> > >> > lower
> > >> > > > than
> > >> > > > > > > > number
> > >> > > > > > > > > of
> > >> > > > > > > > > > > > hours specified by offsets.retention.minutes ?
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <
> > >> > > > > > > > manikumar.reddy@gmail.com
> > >> > > > > > > > > >
> > >> > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > > normally, log.retention.hours (168hrs)  should
> > be
> > >> > > higher
> > >> > > > > than
> > >> > > > > > > > > > > > > offsets.retention.minutes (336 hrs)?
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy
> > >> Vsekhvalnov <
> > >> > > > > > > > > > > > > dvsekhvalnov@gmail.com>
> > >> > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Hi Ted,
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Broker: v0.11.0.0
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Consumer:
> > >> > > > > > > > > > > > > > kafka-clients v0.11.0.0
> > >> > > > > > > > > > > > > > auto.offset.reset = earliest
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <
> > >> > > > > > yuzhihong@gmail.com>
> > >> > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > What's the value for auto.offset.reset  ?
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Which release are you using ?
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Cheers
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy
> > >> > > Vsekhvalnov <
> > >> > > > > > > > > > > > > > > dvsekhvalnov@gmail.com>
> > >> > > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > Hi all,
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > we several time faced situation where
> > >> > > > consumer-group
> > >> > > > > > > > started
> > >> > > > > > > > > to
> > >> > > > > > > > > > > > > > > re-consume
> > >> > > > > > > > > > > > > > > > old events from beginning. Here is
> > scenario:
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > 1. x3 broker kafka cluster on top of x3
> > node
> > >> > > > > zookeeper
> > >> > > > > > > > > > > > > > > > 2. RF=3 for all topics
> > >> > > > > > > > > > > > > > > > 3. log.retention.hours=168 and
> > >> > > > > > > > > offsets.retention.minutes=20160
> > >> > > > > > > > > > > > > > > > 4. running sustainable load (pushing
> > events)
> > >> > > > > > > > > > > > > > > > 5. doing disaster testing by randomly
> > >> shutting
> > >> > > > down 1
> > >> > > > > > of
> > >> > > > > > > 3
> > >> > > > > > > > > > broker
> > >> > > > > > > > > > > > > nodes
> > >> > > > > > > > > > > > > > > > (then provision new broker back)
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > Several times after bouncing broker we
> > faced
> > >> > > > > situation
> > >> > > > > > > > where
> > >> > > > > > > > > > > > consumer
> > >> > > > > > > > > > > > > > > group
> > >> > > > > > > > > > > > > > > > started to re-consume old events.
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > consumer group:
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > 1. enable.auto.commit = false
> > >> > > > > > > > > > > > > > > > 2. tried graceful group shutdown, kill
> -9
> > >> and
> > >> > > > > > terminating
> > >> > > > > > > > AWS
> > >> > > > > > > > > > > nodes
> > >> > > > > > > > > > > > > > > > 3. never experienced re-consumption for
> > >> given
> > >> > > > cases.
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > What can cause that old events
> > >> re-consumption?
> > >> > Is
> > >> > > > it
> > >> > > > > > > > related
> > >> > > > > > > > > to
> > >> > > > > > > > > > > > > > bouncing
> > >> > > > > > > > > > > > > > > > one of brokers? What to search in a
> logs?
> > >> Any
> > >> > > > broker
> > >> > > > > > > > settings
> > >> > > > > > > > > > to
> > >> > > > > > > > > > > > try?
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > Thanks in advance.
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > > The information transmitted is intended only for the person
> or
> > >> > entity
> > >> > > > to
> > >> > > > > > which it is addressed and may contain confidential and/or
> > >> > privileged
> > >> > > > > > material. Any review, retransmission, dissemination or other
> > use
> > >> > of,
> > >> > > or
> > >> > > > > > taking of any action in reliance upon, this information by
> > >> persons
> > >> > or
> > >> > > > > > entities other than the intended recipient is prohibited. If
> > you
> > >> > > > received
> > >> > > > > > this in error, please contact the sender and delete the
> > material
> > >> > from
> > >> > > > any
> > >> > > > > > computer.
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>
> --
>
>
> This email, including attachments, is private and confidential. If you have
> received this email in error please notify the sender and delete it from
> your system. Emails are not secure and may contain viruses. No liability
> can be accepted for viruses that might be transferred by this email or any
> attachment. Any unauthorised copying of this message or unauthorised
> distribution and publication of the information contained herein are
> prohibited.
>
> 7digital Group plc. Registered office: 69 Wilson Street, London EC2A 2BB.
> Registered in England and Wales. Registered No. 04843573.
>

Re: kafka broker loosing offsets?

Posted by Ben Davison <be...@7digital.com>.

Hi Dmitriy,

Did you check out this thread "Incorrect consumer offsets after broker
restart 0.11.0.0" from Phil Luckhurst, it sounds similar.

Thanks,

Ben

On Wed, Oct 11, 2017 at 4:44 PM Dmitriy Vsekhvalnov <dv...@gmail.com>
wrote:

> Hey, want to resurrect this thread.
>
> Decided to do idle test, where no load data is produced to topic at all.
> And when we kill #101 or #102 - nothing happening. But when we kill #200 -
> consumers starts to re-consume old events from random position.
>
> Anybody have ideas what to check?  I really expected that Kafka will fail
> symmetrical with respect to any broker.
>
> On Mon, Oct 9, 2017 at 6:26 PM, Dmitriy Vsekhvalnov <
> dvsekhvalnov@gmail.com>
> wrote:
>
> > Hi tao,
> >
> > we had unclean leader election enabled at the beginning. But then
> disabled
> > it and also reduced 'max.poll.records' value.  It helped little bit.
> >
> > But after today's testing there is strong correlation between lag spike
> > and what broker we crash. For lowest ID (100) broker :
> >   1. it always at least 1-2 orders higher lag
> >   2. we start getting
> >
> > org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be
> > completed since the group has already rebalanced and assigned the
> > partitions to another member. This means that the time between subsequent
> > calls to poll() was longer than the configured max.poll.interval.ms,
> > which typically implies that the poll loop is spending too much time
> > message processing. You can address this either by increasing the session
> > timeout or by reducing the maximum size of batches returned in poll()
> with
> > max.poll.records.
> >
> >   3. sometime re-consumption from random position
> >
> > And when we crashing other brokers (101, 102), it just lag spike of ~10Ks
> > order, settle down quite quickly, no consumer exceptions.
> >
> > Totally lost what to try next.
> >
> > On Sat, Oct 7, 2017 at 2:41 AM, tao xiao <xi...@gmail.com> wrote:
> >
> >> Do you have unclean leader election turned on? If killing 100 is the
> only
> >> way to reproduce the problem, it is possible with unclean leader
> election
> >> turned on that leadership was transferred to out of ISR follower which
> may
> >> not have the latest high watermark
> >> On Sat, Oct 7, 2017 at 3:51 AM Dmitriy Vsekhvalnov <
> >> dvsekhvalnov@gmail.com>
> >> wrote:
> >>
> >> > About to verify hypothesis on monday, but looks like that in latest
> >> tests.
> >> > Need to double check.
> >> >
> >> > On Fri, Oct 6, 2017 at 11:25 PM, Stas Chizhov <sc...@gmail.com>
> >> wrote:
> >> >
> >> > > So no matter in what sequence you shutdown brokers it is only 1 that
> >> > causes
> >> > > the major problem? That would indeed be a bit weird. have you
> checked
> >> > > offsets of your consumer - right after offsets jump back - does it
> >> start
> >> > > from the topic start or does it go back to some random position?
> Have
> >> you
> >> > > checked if all offsets are actually being committed by consumers?
> >> > >
> >> > > fre 6 okt. 2017 kl. 20:59 skrev Dmitriy Vsekhvalnov <
> >> > > dvsekhvalnov@gmail.com
> >> > > >:
> >> > >
> >> > > > Yeah, probably we can dig around.
> >> > > >
> >> > > > One more observation, the most lag/re-consumption trouble
> happening
> >> > when
> >> > > we
> >> > > > kill broker with lowest id (e.g. 100 from [100,101,102]).
> >> > > > When crashing other brokers - there is nothing special happening,
> >> lag
> >> > > > growing little bit but nothing crazy (e.g. thousands, not
> millions).
> >> > > >
> >> > > > Is it sounds suspicious?
> >> > > >
> >> > > > On Fri, Oct 6, 2017 at 9:23 PM, Stas Chizhov <sc...@gmail.com>
> >> > wrote:
> >> > > >
> >> > > > > Ted: when choosing earliest/latest you are saying: if it happens
> >> that
> >> > > > there
> >> > > > > is no "valid" offset committed for a consumer (for whatever
> >> reason:
> >> > > > > bug/misconfiguration/no luck) it will be ok to start from the
> >> > beginning
> >> > > > or
> >> > > > > end of the topic. So if you are not ok with that you should
> choose
> >> > > none.
> >> > > > >
> >> > > > > Dmitriy: Ok. Then it is spring-kafka that maintains this offset
> >> per
> >> > > > > partition state for you. it might also has that problem of
> leaving
> >> > > stale
> >> > > > > offsets lying around, After quickly looking through
> >> > > > > https://github.com/spring-projects/spring-kafka/blob/
> >> > > > > 1945f29d5518e3c4a9950ba82135420dfb61e808/spring-kafka/src/
> >> > > > > main/java/org/springframework/kafka/listener/
> >> > > > > KafkaMessageListenerContainer.java
> >> > > > > it looks possible since offsets map is not cleared upon
> partition
> >> > > > > revocation, but that is just a hypothesis. I have no experience
> >> with
> >> > > > > spring-kafka. However since you say you consumers were always
> >> active
> >> > I
> >> > > > find
> >> > > > > this theory worth investigating.
> >> > > > >
> >> > > > >
> >> > > > > 2017-10-06 18:20 GMT+02:00 Vincent Dautremont <
> >> > > > > vincent.dautremont@olamobile.com.invalid>:
> >> > > > >
> >> > > > > > is there a way to read messages on a topic partition from a
> >> > specific
> >> > > > node
> >> > > > > > we that we choose (and not by the topic partition leader) ?
> >> > > > > > I would like to read myself that each of the
> __consumer_offsets
> >> > > > partition
> >> > > > > > replicas have the same consumer group offset written in it in
> >> it.
> >> > > > > >
> >> > > > > > On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov <
> >> > > > > > dvsekhvalnov@gmail.com>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Stas:
> >> > > > > > >
> >> > > > > > > we rely on spring-kafka, it  commits offsets "manually" for
> us
> >> > > after
> >> > > > > > event
> >> > > > > > > handler completed. So it's kind of automatic once there is
> >> > constant
> >> > > > > > stream
> >> > > > > > > of events (no idle time, which is true for us). Though it's
> >> not
> >> > > what
> >> > > > > pure
> >> > > > > > > kafka-client calls "automatic" (flush commits at fixed
> >> > intervals).
> >> > > > > > >
> >> > > > > > > On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov <
> >> schizhov@gmail.com
> >> > >
> >> > > > > wrote:
> >> > > > > > >
> >> > > > > > > > You don't have autocmmit enables that means you commit
> >> offsets
> >> > > > > > yourself -
> >> > > > > > > > correct? If you store them per partition somewhere and
> fail
> >> to
> >> > > > clean
> >> > > > > it
> >> > > > > > > up
> >> > > > > > > > upon rebalance next time the consumer gets this partition
> >> > > assigned
> >> > > > > > during
> >> > > > > > > > next rebalance it can commit old stale offset- can this be
> >> the
> >> > > > case?
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <
> >> > > > > > > > dvsekhvalnov@gmail.com
> >> > > > > > > > >:
> >> > > > > > > >
> >> > > > > > > > > Reprocessing same events again - is fine for us
> >> (idempotent).
> >> > > > While
> >> > > > > > > > loosing
> >> > > > > > > > > data is more critical.
> >> > > > > > > > >
> >> > > > > > > > > What are reasons of such behaviour? Consumers are never
> >> idle,
> >> > > > > always
> >> > > > > > > > > commiting, probably something wrong with broker setup
> >> then?
> >> > > > > > > > >
> >> > > > > > > > > On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <
> >> yuzhihong@gmail.com>
> >> > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > > Stas:
> >> > > > > > > > > > bq.  using anything but none is not really an option
> >> > > > > > > > > >
> >> > > > > > > > > > If you have time, can you explain a bit more ?
> >> > > > > > > > > >
> >> > > > > > > > > > Thanks
> >> > > > > > > > > >
> >> > > > > > > > > > On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <
> >> > > > schizhov@gmail.com
> >> > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > If you set auto.offset.reset to none next time it
> >> happens
> >> > > you
> >> > > > > > will
> >> > > > > > > be
> >> > > > > > > > > in
> >> > > > > > > > > > > much better position to find out what happens. Also
> in
> >> > > > general
> >> > > > > > with
> >> > > > > > > > > > current
> >> > > > > > > > > > > semantics of offset reset policy IMO using anything
> >> but
> >> > > none
> >> > > > is
> >> > > > > > not
> >> > > > > > > > > > really
> >> > > > > > > > > > > an option unless it is ok for consumer to loose some
> >> data
> >> > > > > > (latest)
> >> > > > > > > or
> >> > > > > > > > > > > reprocess it second time (earliest).
> >> > > > > > > > > > >
> >> > > > > > > > > > > fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <
> >> > > yuzhihong@gmail.com
> >> > > > >:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > Should Kafka log warning if log.retention.hours is
> >> > lower
> >> > > > than
> >> > > > > > > > number
> >> > > > > > > > > of
> >> > > > > > > > > > > > hours specified by offsets.retention.minutes ?
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <
> >> > > > > > > > manikumar.reddy@gmail.com
> >> > > > > > > > > >
> >> > > > > > > > > > > > wrote:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > normally, log.retention.hours (168hrs)  should
> be
> >> > > higher
> >> > > > > than
> >> > > > > > > > > > > > > offsets.retention.minutes (336 hrs)?
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy
> >> Vsekhvalnov <
> >> > > > > > > > > > > > > dvsekhvalnov@gmail.com>
> >> > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Hi Ted,
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Broker: v0.11.0.0
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Consumer:
> >> > > > > > > > > > > > > > kafka-clients v0.11.0.0
> >> > > > > > > > > > > > > > auto.offset.reset = earliest
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <
> >> > > > > > yuzhihong@gmail.com>
> >> > > > > > > > > > wrote:
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > What's the value for auto.offset.reset  ?
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Which release are you using ?
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Cheers
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy
> >> > > Vsekhvalnov <
> >> > > > > > > > > > > > > > > dvsekhvalnov@gmail.com>
> >> > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Hi all,
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > we several time faced situation where
> >> > > > consumer-group
> >> > > > > > > > started
> >> > > > > > > > > to
> >> > > > > > > > > > > > > > > re-consume
> >> > > > > > > > > > > > > > > > old events from beginning. Here is
> scenario:
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > 1. x3 broker kafka cluster on top of x3
> node
> >> > > > > zookeeper
> >> > > > > > > > > > > > > > > > 2. RF=3 for all topics
> >> > > > > > > > > > > > > > > > 3. log.retention.hours=168 and
> >> > > > > > > > > offsets.retention.minutes=20160
> >> > > > > > > > > > > > > > > > 4. running sustainable load (pushing
> events)
> >> > > > > > > > > > > > > > > > 5. doing disaster testing by randomly
> >> shutting
> >> > > > down 1
> >> > > > > > of
> >> > > > > > > 3
> >> > > > > > > > > > broker
> >> > > > > > > > > > > > > nodes
> >> > > > > > > > > > > > > > > > (then provision new broker back)
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Several times after bouncing broker we
> faced
> >> > > > > situation
> >> > > > > > > > where
> >> > > > > > > > > > > > consumer
> >> > > > > > > > > > > > > > > group
> >> > > > > > > > > > > > > > > > started to re-consume old events.
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > consumer group:
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > 1. enable.auto.commit = false
> >> > > > > > > > > > > > > > > > 2. tried graceful group shutdown, kill -9
> >> and
> >> > > > > > terminating
> >> > > > > > > > AWS
> >> > > > > > > > > > > nodes
> >> > > > > > > > > > > > > > > > 3. never experienced re-consumption for
> >> given
> >> > > > cases.
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > What can cause that old events
> >> re-consumption?
> >> > Is
> >> > > > it
> >> > > > > > > > related
> >> > > > > > > > > to
> >> > > > > > > > > > > > > > bouncing
> >> > > > > > > > > > > > > > > > one of brokers? What to search in a logs?
> >> Any
> >> > > > broker
> >> > > > > > > > settings
> >> > > > > > > > > > to
> >> > > > > > > > > > > > try?
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Thanks in advance.
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > The information transmitted is intended only for the person or
> >> > entity
> >> > > > to
> >> > > > > > which it is addressed and may contain confidential and/or
> >> > privileged
> >> > > > > > material. Any review, retransmission, dissemination or other
> use
> >> > of,
> >> > > or
> >> > > > > > taking of any action in reliance upon, this information by
> >> persons
> >> > or
> >> > > > > > entities other than the intended recipient is prohibited. If
> you
> >> > > > received
> >> > > > > > this in error, please contact the sender and delete the
> material
> >> > from
> >> > > > any
> >> > > > > > computer.
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

-- 


This email, including attachments, is private and confidential. If you have 
received this email in error please notify the sender and delete it from 
your system. Emails are not secure and may contain viruses. No liability 
can be accepted for viruses that might be transferred by this email or any 
attachment. Any unauthorised copying of this message or unauthorised 
distribution and publication of the information contained herein are 
prohibited.

7digital Group plc. Registered office: 69 Wilson Street, London EC2A 2BB.
Registered in England and Wales. Registered No. 04843573.

Re: kafka broker loosing offsets?

Posted by Dmitriy Vsekhvalnov <dv...@gmail.com>.

Hey, want to resurrect this thread.

Decided to do idle test, where no load data is produced to topic at all.
And when we kill #101 or #102 - nothing happening. But when we kill #200 -
consumers starts to re-consume old events from random position.

Anybody have ideas what to check?  I really expected that Kafka will fail
symmetrical with respect to any broker.

On Mon, Oct 9, 2017 at 6:26 PM, Dmitriy Vsekhvalnov <dv...@gmail.com>
wrote:

> Hi tao,
>
> we had unclean leader election enabled at the beginning. But then disabled
> it and also reduced 'max.poll.records' value.  It helped little bit.
>
> But after today's testing there is strong correlation between lag spike
> and what broker we crash. For lowest ID (100) broker :
>   1. it always at least 1-2 orders higher lag
>   2. we start getting
>
> org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be
> completed since the group has already rebalanced and assigned the
> partitions to another member. This means that the time between subsequent
> calls to poll() was longer than the configured max.poll.interval.ms,
> which typically implies that the poll loop is spending too much time
> message processing. You can address this either by increasing the session
> timeout or by reducing the maximum size of batches returned in poll() with
> max.poll.records.
>
>   3. sometime re-consumption from random position
>
> And when we crashing other brokers (101, 102), it just lag spike of ~10Ks
> order, settle down quite quickly, no consumer exceptions.
>
> Totally lost what to try next.
>
> On Sat, Oct 7, 2017 at 2:41 AM, tao xiao <xi...@gmail.com> wrote:
>
>> Do you have unclean leader election turned on? If killing 100 is the only
>> way to reproduce the problem, it is possible with unclean leader election
>> turned on that leadership was transferred to out of ISR follower which may
>> not have the latest high watermark
>> On Sat, Oct 7, 2017 at 3:51 AM Dmitriy Vsekhvalnov <
>> dvsekhvalnov@gmail.com>
>> wrote:
>>
>> > About to verify hypothesis on monday, but looks like that in latest
>> tests.
>> > Need to double check.
>> >
>> > On Fri, Oct 6, 2017 at 11:25 PM, Stas Chizhov <sc...@gmail.com>
>> wrote:
>> >
>> > > So no matter in what sequence you shutdown brokers it is only 1 that
>> > causes
>> > > the major problem? That would indeed be a bit weird. have you checked
>> > > offsets of your consumer - right after offsets jump back - does it
>> start
>> > > from the topic start or does it go back to some random position? Have
>> you
>> > > checked if all offsets are actually being committed by consumers?
>> > >
>> > > fre 6 okt. 2017 kl. 20:59 skrev Dmitriy Vsekhvalnov <
>> > > dvsekhvalnov@gmail.com
>> > > >:
>> > >
>> > > > Yeah, probably we can dig around.
>> > > >
>> > > > One more observation, the most lag/re-consumption trouble happening
>> > when
>> > > we
>> > > > kill broker with lowest id (e.g. 100 from [100,101,102]).
>> > > > When crashing other brokers - there is nothing special happening,
>> lag
>> > > > growing little bit but nothing crazy (e.g. thousands, not millions).
>> > > >
>> > > > Is it sounds suspicious?
>> > > >
>> > > > On Fri, Oct 6, 2017 at 9:23 PM, Stas Chizhov <sc...@gmail.com>
>> > wrote:
>> > > >
>> > > > > Ted: when choosing earliest/latest you are saying: if it happens
>> that
>> > > > there
>> > > > > is no "valid" offset committed for a consumer (for whatever
>> reason:
>> > > > > bug/misconfiguration/no luck) it will be ok to start from the
>> > beginning
>> > > > or
>> > > > > end of the topic. So if you are not ok with that you should choose
>> > > none.
>> > > > >
>> > > > > Dmitriy: Ok. Then it is spring-kafka that maintains this offset
>> per
>> > > > > partition state for you. it might also has that problem of leaving
>> > > stale
>> > > > > offsets lying around, After quickly looking through
>> > > > > https://github.com/spring-projects/spring-kafka/blob/
>> > > > > 1945f29d5518e3c4a9950ba82135420dfb61e808/spring-kafka/src/
>> > > > > main/java/org/springframework/kafka/listener/
>> > > > > KafkaMessageListenerContainer.java
>> > > > > it looks possible since offsets map is not cleared upon partition
>> > > > > revocation, but that is just a hypothesis. I have no experience
>> with
>> > > > > spring-kafka. However since you say you consumers were always
>> active
>> > I
>> > > > find
>> > > > > this theory worth investigating.
>> > > > >
>> > > > >
>> > > > > 2017-10-06 18:20 GMT+02:00 Vincent Dautremont <
>> > > > > vincent.dautremont@olamobile.com.invalid>:
>> > > > >
>> > > > > > is there a way to read messages on a topic partition from a
>> > specific
>> > > > node
>> > > > > > we that we choose (and not by the topic partition leader) ?
>> > > > > > I would like to read myself that each of the __consumer_offsets
>> > > > partition
>> > > > > > replicas have the same consumer group offset written in it in
>> it.
>> > > > > >
>> > > > > > On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov <
>> > > > > > dvsekhvalnov@gmail.com>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Stas:
>> > > > > > >
>> > > > > > > we rely on spring-kafka, it  commits offsets "manually" for us
>> > > after
>> > > > > > event
>> > > > > > > handler completed. So it's kind of automatic once there is
>> > constant
>> > > > > > stream
>> > > > > > > of events (no idle time, which is true for us). Though it's
>> not
>> > > what
>> > > > > pure
>> > > > > > > kafka-client calls "automatic" (flush commits at fixed
>> > intervals).
>> > > > > > >
>> > > > > > > On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov <
>> schizhov@gmail.com
>> > >
>> > > > > wrote:
>> > > > > > >
>> > > > > > > > You don't have autocmmit enables that means you commit
>> offsets
>> > > > > > yourself -
>> > > > > > > > correct? If you store them per partition somewhere and fail
>> to
>> > > > clean
>> > > > > it
>> > > > > > > up
>> > > > > > > > upon rebalance next time the consumer gets this partition
>> > > assigned
>> > > > > > during
>> > > > > > > > next rebalance it can commit old stale offset- can this be
>> the
>> > > > case?
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <
>> > > > > > > > dvsekhvalnov@gmail.com
>> > > > > > > > >:
>> > > > > > > >
>> > > > > > > > > Reprocessing same events again - is fine for us
>> (idempotent).
>> > > > While
>> > > > > > > > loosing
>> > > > > > > > > data is more critical.
>> > > > > > > > >
>> > > > > > > > > What are reasons of such behaviour? Consumers are never
>> idle,
>> > > > > always
>> > > > > > > > > commiting, probably something wrong with broker setup
>> then?
>> > > > > > > > >
>> > > > > > > > > On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <
>> yuzhihong@gmail.com>
>> > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > Stas:
>> > > > > > > > > > bq.  using anything but none is not really an option
>> > > > > > > > > >
>> > > > > > > > > > If you have time, can you explain a bit more ?
>> > > > > > > > > >
>> > > > > > > > > > Thanks
>> > > > > > > > > >
>> > > > > > > > > > On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <
>> > > > schizhov@gmail.com
>> > > > > >
>> > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > If you set auto.offset.reset to none next time it
>> happens
>> > > you
>> > > > > > will
>> > > > > > > be
>> > > > > > > > > in
>> > > > > > > > > > > much better position to find out what happens. Also in
>> > > > general
>> > > > > > with
>> > > > > > > > > > current
>> > > > > > > > > > > semantics of offset reset policy IMO using anything
>> but
>> > > none
>> > > > is
>> > > > > > not
>> > > > > > > > > > really
>> > > > > > > > > > > an option unless it is ok for consumer to loose some
>> data
>> > > > > > (latest)
>> > > > > > > or
>> > > > > > > > > > > reprocess it second time (earliest).
>> > > > > > > > > > >
>> > > > > > > > > > > fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <
>> > > yuzhihong@gmail.com
>> > > > >:
>> > > > > > > > > > >
>> > > > > > > > > > > > Should Kafka log warning if log.retention.hours is
>> > lower
>> > > > than
>> > > > > > > > number
>> > > > > > > > > of
>> > > > > > > > > > > > hours specified by offsets.retention.minutes ?
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <
>> > > > > > > > manikumar.reddy@gmail.com
>> > > > > > > > > >
>> > > > > > > > > > > > wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > > normally, log.retention.hours (168hrs)  should be
>> > > higher
>> > > > > than
>> > > > > > > > > > > > > offsets.retention.minutes (336 hrs)?
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy
>> Vsekhvalnov <
>> > > > > > > > > > > > > dvsekhvalnov@gmail.com>
>> > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > > Hi Ted,
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Broker: v0.11.0.0
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Consumer:
>> > > > > > > > > > > > > > kafka-clients v0.11.0.0
>> > > > > > > > > > > > > > auto.offset.reset = earliest
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <
>> > > > > > yuzhihong@gmail.com>
>> > > > > > > > > > wrote:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > What's the value for auto.offset.reset  ?
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Which release are you using ?
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Cheers
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy
>> > > Vsekhvalnov <
>> > > > > > > > > > > > > > > dvsekhvalnov@gmail.com>
>> > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Hi all,
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > we several time faced situation where
>> > > > consumer-group
>> > > > > > > > started
>> > > > > > > > > to
>> > > > > > > > > > > > > > > re-consume
>> > > > > > > > > > > > > > > > old events from beginning. Here is scenario:
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > 1. x3 broker kafka cluster on top of x3 node
>> > > > > zookeeper
>> > > > > > > > > > > > > > > > 2. RF=3 for all topics
>> > > > > > > > > > > > > > > > 3. log.retention.hours=168 and
>> > > > > > > > > offsets.retention.minutes=20160
>> > > > > > > > > > > > > > > > 4. running sustainable load (pushing events)
>> > > > > > > > > > > > > > > > 5. doing disaster testing by randomly
>> shutting
>> > > > down 1
>> > > > > > of
>> > > > > > > 3
>> > > > > > > > > > broker
>> > > > > > > > > > > > > nodes
>> > > > > > > > > > > > > > > > (then provision new broker back)
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Several times after bouncing broker we faced
>> > > > > situation
>> > > > > > > > where
>> > > > > > > > > > > > consumer
>> > > > > > > > > > > > > > > group
>> > > > > > > > > > > > > > > > started to re-consume old events.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > consumer group:
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > 1. enable.auto.commit = false
>> > > > > > > > > > > > > > > > 2. tried graceful group shutdown, kill -9
>> and
>> > > > > > terminating
>> > > > > > > > AWS
>> > > > > > > > > > > nodes
>> > > > > > > > > > > > > > > > 3. never experienced re-consumption for
>> given
>> > > > cases.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > What can cause that old events
>> re-consumption?
>> > Is
>> > > > it
>> > > > > > > > related
>> > > > > > > > > to
>> > > > > > > > > > > > > > bouncing
>> > > > > > > > > > > > > > > > one of brokers? What to search in a logs?
>> Any
>> > > > broker
>> > > > > > > > settings
>> > > > > > > > > > to
>> > > > > > > > > > > > try?
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Thanks in advance.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > The information transmitted is intended only for the person or
>> > entity
>> > > > to
>> > > > > > which it is addressed and may contain confidential and/or
>> > privileged
>> > > > > > material. Any review, retransmission, dissemination or other use
>> > of,
>> > > or
>> > > > > > taking of any action in reliance upon, this information by
>> persons
>> > or
>> > > > > > entities other than the intended recipient is prohibited. If you
>> > > > received
>> > > > > > this in error, please contact the sender and delete the material
>> > from
>> > > > any
>> > > > > > computer.
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: kafka broker loosing offsets?

Posted by Dmitriy Vsekhvalnov <dv...@gmail.com>.

Hi tao,

we had unclean leader election enabled at the beginning. But then disabled
it and also reduced 'max.poll.records' value.  It helped little bit.

But after today's testing there is strong correlation between lag spike and
what broker we crash. For lowest ID (100) broker :
  1. it always at least 1-2 orders higher lag
  2. we start getting

org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be
completed since the group has already rebalanced and assigned the
partitions to another member. This means that the time between subsequent
calls to poll() was longer than the configured max.poll.interval.ms, which
typically implies that the poll loop is spending too much time message
processing. You can address this either by increasing the session timeout
or by reducing the maximum size of batches returned in poll() with
max.poll.records.

  3. sometime re-consumption from random position

And when we crashing other brokers (101, 102), it just lag spike of ~10Ks
order, settle down quite quickly, no consumer exceptions.

Totally lost what to try next.

On Sat, Oct 7, 2017 at 2:41 AM, tao xiao <xi...@gmail.com> wrote:

> Do you have unclean leader election turned on? If killing 100 is the only
> way to reproduce the problem, it is possible with unclean leader election
> turned on that leadership was transferred to out of ISR follower which may
> not have the latest high watermark
> On Sat, Oct 7, 2017 at 3:51 AM Dmitriy Vsekhvalnov <dvsekhvalnov@gmail.com
> >
> wrote:
>
> > About to verify hypothesis on monday, but looks like that in latest
> tests.
> > Need to double check.
> >
> > On Fri, Oct 6, 2017 at 11:25 PM, Stas Chizhov <sc...@gmail.com>
> wrote:
> >
> > > So no matter in what sequence you shutdown brokers it is only 1 that
> > causes
> > > the major problem? That would indeed be a bit weird. have you checked
> > > offsets of your consumer - right after offsets jump back - does it
> start
> > > from the topic start or does it go back to some random position? Have
> you
> > > checked if all offsets are actually being committed by consumers?
> > >
> > > fre 6 okt. 2017 kl. 20:59 skrev Dmitriy Vsekhvalnov <
> > > dvsekhvalnov@gmail.com
> > > >:
> > >
> > > > Yeah, probably we can dig around.
> > > >
> > > > One more observation, the most lag/re-consumption trouble happening
> > when
> > > we
> > > > kill broker with lowest id (e.g. 100 from [100,101,102]).
> > > > When crashing other brokers - there is nothing special happening, lag
> > > > growing little bit but nothing crazy (e.g. thousands, not millions).
> > > >
> > > > Is it sounds suspicious?
> > > >
> > > > On Fri, Oct 6, 2017 at 9:23 PM, Stas Chizhov <sc...@gmail.com>
> > wrote:
> > > >
> > > > > Ted: when choosing earliest/latest you are saying: if it happens
> that
> > > > there
> > > > > is no "valid" offset committed for a consumer (for whatever reason:
> > > > > bug/misconfiguration/no luck) it will be ok to start from the
> > beginning
> > > > or
> > > > > end of the topic. So if you are not ok with that you should choose
> > > none.
> > > > >
> > > > > Dmitriy: Ok. Then it is spring-kafka that maintains this offset per
> > > > > partition state for you. it might also has that problem of leaving
> > > stale
> > > > > offsets lying around, After quickly looking through
> > > > > https://github.com/spring-projects/spring-kafka/blob/
> > > > > 1945f29d5518e3c4a9950ba82135420dfb61e808/spring-kafka/src/
> > > > > main/java/org/springframework/kafka/listener/
> > > > > KafkaMessageListenerContainer.java
> > > > > it looks possible since offsets map is not cleared upon partition
> > > > > revocation, but that is just a hypothesis. I have no experience
> with
> > > > > spring-kafka. However since you say you consumers were always
> active
> > I
> > > > find
> > > > > this theory worth investigating.
> > > > >
> > > > >
> > > > > 2017-10-06 18:20 GMT+02:00 Vincent Dautremont <
> > > > > vincent.dautremont@olamobile.com.invalid>:
> > > > >
> > > > > > is there a way to read messages on a topic partition from a
> > specific
> > > > node
> > > > > > we that we choose (and not by the topic partition leader) ?
> > > > > > I would like to read myself that each of the __consumer_offsets
> > > > partition
> > > > > > replicas have the same consumer group offset written in it in it.
> > > > > >
> > > > > > On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov <
> > > > > > dvsekhvalnov@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Stas:
> > > > > > >
> > > > > > > we rely on spring-kafka, it  commits offsets "manually" for us
> > > after
> > > > > > event
> > > > > > > handler completed. So it's kind of automatic once there is
> > constant
> > > > > > stream
> > > > > > > of events (no idle time, which is true for us). Though it's not
> > > what
> > > > > pure
> > > > > > > kafka-client calls "automatic" (flush commits at fixed
> > intervals).
> > > > > > >
> > > > > > > On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov <
> schizhov@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > You don't have autocmmit enables that means you commit
> offsets
> > > > > > yourself -
> > > > > > > > correct? If you store them per partition somewhere and fail
> to
> > > > clean
> > > > > it
> > > > > > > up
> > > > > > > > upon rebalance next time the consumer gets this partition
> > > assigned
> > > > > > during
> > > > > > > > next rebalance it can commit old stale offset- can this be
> the
> > > > case?
> > > > > > > >
> > > > > > > >
> > > > > > > > fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <
> > > > > > > > dvsekhvalnov@gmail.com
> > > > > > > > >:
> > > > > > > >
> > > > > > > > > Reprocessing same events again - is fine for us
> (idempotent).
> > > > While
> > > > > > > > loosing
> > > > > > > > > data is more critical.
> > > > > > > > >
> > > > > > > > > What are reasons of such behaviour? Consumers are never
> idle,
> > > > > always
> > > > > > > > > commiting, probably something wrong with broker setup then?
> > > > > > > > >
> > > > > > > > > On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <
> yuzhihong@gmail.com>
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Stas:
> > > > > > > > > > bq.  using anything but none is not really an option
> > > > > > > > > >
> > > > > > > > > > If you have time, can you explain a bit more ?
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > >
> > > > > > > > > > On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <
> > > > schizhov@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > If you set auto.offset.reset to none next time it
> happens
> > > you
> > > > > > will
> > > > > > > be
> > > > > > > > > in
> > > > > > > > > > > much better position to find out what happens. Also in
> > > > general
> > > > > > with
> > > > > > > > > > current
> > > > > > > > > > > semantics of offset reset policy IMO using anything but
> > > none
> > > > is
> > > > > > not
> > > > > > > > > > really
> > > > > > > > > > > an option unless it is ok for consumer to loose some
> data
> > > > > > (latest)
> > > > > > > or
> > > > > > > > > > > reprocess it second time (earliest).
> > > > > > > > > > >
> > > > > > > > > > > fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <
> > > yuzhihong@gmail.com
> > > > >:
> > > > > > > > > > >
> > > > > > > > > > > > Should Kafka log warning if log.retention.hours is
> > lower
> > > > than
> > > > > > > > number
> > > > > > > > > of
> > > > > > > > > > > > hours specified by offsets.retention.minutes ?
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <
> > > > > > > > manikumar.reddy@gmail.com
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > normally, log.retention.hours (168hrs)  should be
> > > higher
> > > > > than
> > > > > > > > > > > > > offsets.retention.minutes (336 hrs)?
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy
> Vsekhvalnov <
> > > > > > > > > > > > > dvsekhvalnov@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Ted,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Broker: v0.11.0.0
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Consumer:
> > > > > > > > > > > > > > kafka-clients v0.11.0.0
> > > > > > > > > > > > > > auto.offset.reset = earliest
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <
> > > > > > yuzhihong@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > What's the value for auto.offset.reset  ?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Which release are you using ?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Cheers
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy
> > > Vsekhvalnov <
> > > > > > > > > > > > > > > dvsekhvalnov@gmail.com>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > we several time faced situation where
> > > > consumer-group
> > > > > > > > started
> > > > > > > > > to
> > > > > > > > > > > > > > > re-consume
> > > > > > > > > > > > > > > > old events from beginning. Here is scenario:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. x3 broker kafka cluster on top of x3 node
> > > > > zookeeper
> > > > > > > > > > > > > > > > 2. RF=3 for all topics
> > > > > > > > > > > > > > > > 3. log.retention.hours=168 and
> > > > > > > > > offsets.retention.minutes=20160
> > > > > > > > > > > > > > > > 4. running sustainable load (pushing events)
> > > > > > > > > > > > > > > > 5. doing disaster testing by randomly
> shutting
> > > > down 1
> > > > > > of
> > > > > > > 3
> > > > > > > > > > broker
> > > > > > > > > > > > > nodes
> > > > > > > > > > > > > > > > (then provision new broker back)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Several times after bouncing broker we faced
> > > > > situation
> > > > > > > > where
> > > > > > > > > > > > consumer
> > > > > > > > > > > > > > > group
> > > > > > > > > > > > > > > > started to re-consume old events.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > consumer group:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. enable.auto.commit = false
> > > > > > > > > > > > > > > > 2. tried graceful group shutdown, kill -9 and
> > > > > > terminating
> > > > > > > > AWS
> > > > > > > > > > > nodes
> > > > > > > > > > > > > > > > 3. never experienced re-consumption for given
> > > > cases.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > What can cause that old events
> re-consumption?
> > Is
> > > > it
> > > > > > > > related
> > > > > > > > > to
> > > > > > > > > > > > > > bouncing
> > > > > > > > > > > > > > > > one of brokers? What to search in a logs? Any
> > > > broker
> > > > > > > > settings
> > > > > > > > > > to
> > > > > > > > > > > > try?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks in advance.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > The information transmitted is intended only for the person or
> > entity
> > > > to
> > > > > > which it is addressed and may contain confidential and/or
> > privileged
> > > > > > material. Any review, retransmission, dissemination or other use
> > of,
> > > or
> > > > > > taking of any action in reliance upon, this information by
> persons
> > or
> > > > > > entities other than the intended recipient is prohibited. If you
> > > > received
> > > > > > this in error, please contact the sender and delete the material
> > from
> > > > any
> > > > > > computer.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: kafka broker loosing offsets?

Posted by tao xiao <xi...@gmail.com>.

Do you have unclean leader election turned on? If killing 100 is the only
way to reproduce the problem, it is possible with unclean leader election
turned on that leadership was transferred to out of ISR follower which may
not have the latest high watermark
On Sat, Oct 7, 2017 at 3:51 AM Dmitriy Vsekhvalnov <dv...@gmail.com>
wrote:

> About to verify hypothesis on monday, but looks like that in latest tests.
> Need to double check.
>
> On Fri, Oct 6, 2017 at 11:25 PM, Stas Chizhov <sc...@gmail.com> wrote:
>
> > So no matter in what sequence you shutdown brokers it is only 1 that
> causes
> > the major problem? That would indeed be a bit weird. have you checked
> > offsets of your consumer - right after offsets jump back - does it start
> > from the topic start or does it go back to some random position? Have you
> > checked if all offsets are actually being committed by consumers?
> >
> > fre 6 okt. 2017 kl. 20:59 skrev Dmitriy Vsekhvalnov <
> > dvsekhvalnov@gmail.com
> > >:
> >
> > > Yeah, probably we can dig around.
> > >
> > > One more observation, the most lag/re-consumption trouble happening
> when
> > we
> > > kill broker with lowest id (e.g. 100 from [100,101,102]).
> > > When crashing other brokers - there is nothing special happening, lag
> > > growing little bit but nothing crazy (e.g. thousands, not millions).
> > >
> > > Is it sounds suspicious?
> > >
> > > On Fri, Oct 6, 2017 at 9:23 PM, Stas Chizhov <sc...@gmail.com>
> wrote:
> > >
> > > > Ted: when choosing earliest/latest you are saying: if it happens that
> > > there
> > > > is no "valid" offset committed for a consumer (for whatever reason:
> > > > bug/misconfiguration/no luck) it will be ok to start from the
> beginning
> > > or
> > > > end of the topic. So if you are not ok with that you should choose
> > none.
> > > >
> > > > Dmitriy: Ok. Then it is spring-kafka that maintains this offset per
> > > > partition state for you. it might also has that problem of leaving
> > stale
> > > > offsets lying around, After quickly looking through
> > > > https://github.com/spring-projects/spring-kafka/blob/
> > > > 1945f29d5518e3c4a9950ba82135420dfb61e808/spring-kafka/src/
> > > > main/java/org/springframework/kafka/listener/
> > > > KafkaMessageListenerContainer.java
> > > > it looks possible since offsets map is not cleared upon partition
> > > > revocation, but that is just a hypothesis. I have no experience with
> > > > spring-kafka. However since you say you consumers were always active
> I
> > > find
> > > > this theory worth investigating.
> > > >
> > > >
> > > > 2017-10-06 18:20 GMT+02:00 Vincent Dautremont <
> > > > vincent.dautremont@olamobile.com.invalid>:
> > > >
> > > > > is there a way to read messages on a topic partition from a
> specific
> > > node
> > > > > we that we choose (and not by the topic partition leader) ?
> > > > > I would like to read myself that each of the __consumer_offsets
> > > partition
> > > > > replicas have the same consumer group offset written in it in it.
> > > > >
> > > > > On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov <
> > > > > dvsekhvalnov@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Stas:
> > > > > >
> > > > > > we rely on spring-kafka, it  commits offsets "manually" for us
> > after
> > > > > event
> > > > > > handler completed. So it's kind of automatic once there is
> constant
> > > > > stream
> > > > > > of events (no idle time, which is true for us). Though it's not
> > what
> > > > pure
> > > > > > kafka-client calls "automatic" (flush commits at fixed
> intervals).
> > > > > >
> > > > > > On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov <schizhov@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > You don't have autocmmit enables that means you commit offsets
> > > > > yourself -
> > > > > > > correct? If you store them per partition somewhere and fail to
> > > clean
> > > > it
> > > > > > up
> > > > > > > upon rebalance next time the consumer gets this partition
> > assigned
> > > > > during
> > > > > > > next rebalance it can commit old stale offset- can this be the
> > > case?
> > > > > > >
> > > > > > >
> > > > > > > fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <
> > > > > > > dvsekhvalnov@gmail.com
> > > > > > > >:
> > > > > > >
> > > > > > > > Reprocessing same events again - is fine for us (idempotent).
> > > While
> > > > > > > loosing
> > > > > > > > data is more critical.
> > > > > > > >
> > > > > > > > What are reasons of such behaviour? Consumers are never idle,
> > > > always
> > > > > > > > commiting, probably something wrong with broker setup then?
> > > > > > > >
> > > > > > > > On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <yu...@gmail.com>
> > > > wrote:
> > > > > > > >
> > > > > > > > > Stas:
> > > > > > > > > bq.  using anything but none is not really an option
> > > > > > > > >
> > > > > > > > > If you have time, can you explain a bit more ?
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > >
> > > > > > > > > On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <
> > > schizhov@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > If you set auto.offset.reset to none next time it happens
> > you
> > > > > will
> > > > > > be
> > > > > > > > in
> > > > > > > > > > much better position to find out what happens. Also in
> > > general
> > > > > with
> > > > > > > > > current
> > > > > > > > > > semantics of offset reset policy IMO using anything but
> > none
> > > is
> > > > > not
> > > > > > > > > really
> > > > > > > > > > an option unless it is ok for consumer to loose some data
> > > > > (latest)
> > > > > > or
> > > > > > > > > > reprocess it second time (earliest).
> > > > > > > > > >
> > > > > > > > > > fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <
> > yuzhihong@gmail.com
> > > >:
> > > > > > > > > >
> > > > > > > > > > > Should Kafka log warning if log.retention.hours is
> lower
> > > than
> > > > > > > number
> > > > > > > > of
> > > > > > > > > > > hours specified by offsets.retention.minutes ?
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <
> > > > > > > manikumar.reddy@gmail.com
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > normally, log.retention.hours (168hrs)  should be
> > higher
> > > > than
> > > > > > > > > > > > offsets.retention.minutes (336 hrs)?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy Vsekhvalnov <
> > > > > > > > > > > > dvsekhvalnov@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Ted,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Broker: v0.11.0.0
> > > > > > > > > > > > >
> > > > > > > > > > > > > Consumer:
> > > > > > > > > > > > > kafka-clients v0.11.0.0
> > > > > > > > > > > > > auto.offset.reset = earliest
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <
> > > > > yuzhihong@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > What's the value for auto.offset.reset  ?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Which release are you using ?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Cheers
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy
> > Vsekhvalnov <
> > > > > > > > > > > > > > dvsekhvalnov@gmail.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > we several time faced situation where
> > > consumer-group
> > > > > > > started
> > > > > > > > to
> > > > > > > > > > > > > > re-consume
> > > > > > > > > > > > > > > old events from beginning. Here is scenario:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. x3 broker kafka cluster on top of x3 node
> > > > zookeeper
> > > > > > > > > > > > > > > 2. RF=3 for all topics
> > > > > > > > > > > > > > > 3. log.retention.hours=168 and
> > > > > > > > offsets.retention.minutes=20160
> > > > > > > > > > > > > > > 4. running sustainable load (pushing events)
> > > > > > > > > > > > > > > 5. doing disaster testing by randomly shutting
> > > down 1
> > > > > of
> > > > > > 3
> > > > > > > > > broker
> > > > > > > > > > > > nodes
> > > > > > > > > > > > > > > (then provision new broker back)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Several times after bouncing broker we faced
> > > > situation
> > > > > > > where
> > > > > > > > > > > consumer
> > > > > > > > > > > > > > group
> > > > > > > > > > > > > > > started to re-consume old events.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > consumer group:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. enable.auto.commit = false
> > > > > > > > > > > > > > > 2. tried graceful group shutdown, kill -9 and
> > > > > terminating
> > > > > > > AWS
> > > > > > > > > > nodes
> > > > > > > > > > > > > > > 3. never experienced re-consumption for given
> > > cases.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > What can cause that old events re-consumption?
> Is
> > > it
> > > > > > > related
> > > > > > > > to
> > > > > > > > > > > > > bouncing
> > > > > > > > > > > > > > > one of brokers? What to search in a logs? Any
> > > broker
> > > > > > > settings
> > > > > > > > > to
> > > > > > > > > > > try?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks in advance.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > The information transmitted is intended only for the person or
> entity
> > > to
> > > > > which it is addressed and may contain confidential and/or
> privileged
> > > > > material. Any review, retransmission, dissemination or other use
> of,
> > or
> > > > > taking of any action in reliance upon, this information by persons
> or
> > > > > entities other than the intended recipient is prohibited. If you
> > > received
> > > > > this in error, please contact the sender and delete the material
> from
> > > any
> > > > > computer.
> > > > >
> > > >
> > >
> >
>

Re: kafka broker loosing offsets?

Posted by Dmitriy Vsekhvalnov <dv...@gmail.com>.

About to verify hypothesis on monday, but looks like that in latest tests.
Need to double check.

On Fri, Oct 6, 2017 at 11:25 PM, Stas Chizhov <sc...@gmail.com> wrote:

> So no matter in what sequence you shutdown brokers it is only 1 that causes
> the major problem? That would indeed be a bit weird. have you checked
> offsets of your consumer - right after offsets jump back - does it start
> from the topic start or does it go back to some random position? Have you
> checked if all offsets are actually being committed by consumers?
>
> fre 6 okt. 2017 kl. 20:59 skrev Dmitriy Vsekhvalnov <
> dvsekhvalnov@gmail.com
> >:
>
> > Yeah, probably we can dig around.
> >
> > One more observation, the most lag/re-consumption trouble happening when
> we
> > kill broker with lowest id (e.g. 100 from [100,101,102]).
> > When crashing other brokers - there is nothing special happening, lag
> > growing little bit but nothing crazy (e.g. thousands, not millions).
> >
> > Is it sounds suspicious?
> >
> > On Fri, Oct 6, 2017 at 9:23 PM, Stas Chizhov <sc...@gmail.com> wrote:
> >
> > > Ted: when choosing earliest/latest you are saying: if it happens that
> > there
> > > is no "valid" offset committed for a consumer (for whatever reason:
> > > bug/misconfiguration/no luck) it will be ok to start from the beginning
> > or
> > > end of the topic. So if you are not ok with that you should choose
> none.
> > >
> > > Dmitriy: Ok. Then it is spring-kafka that maintains this offset per
> > > partition state for you. it might also has that problem of leaving
> stale
> > > offsets lying around, After quickly looking through
> > > https://github.com/spring-projects/spring-kafka/blob/
> > > 1945f29d5518e3c4a9950ba82135420dfb61e808/spring-kafka/src/
> > > main/java/org/springframework/kafka/listener/
> > > KafkaMessageListenerContainer.java
> > > it looks possible since offsets map is not cleared upon partition
> > > revocation, but that is just a hypothesis. I have no experience with
> > > spring-kafka. However since you say you consumers were always active I
> > find
> > > this theory worth investigating.
> > >
> > >
> > > 2017-10-06 18:20 GMT+02:00 Vincent Dautremont <
> > > vincent.dautremont@olamobile.com.invalid>:
> > >
> > > > is there a way to read messages on a topic partition from a specific
> > node
> > > > we that we choose (and not by the topic partition leader) ?
> > > > I would like to read myself that each of the __consumer_offsets
> > partition
> > > > replicas have the same consumer group offset written in it in it.
> > > >
> > > > On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov <
> > > > dvsekhvalnov@gmail.com>
> > > > wrote:
> > > >
> > > > > Stas:
> > > > >
> > > > > we rely on spring-kafka, it  commits offsets "manually" for us
> after
> > > > event
> > > > > handler completed. So it's kind of automatic once there is constant
> > > > stream
> > > > > of events (no idle time, which is true for us). Though it's not
> what
> > > pure
> > > > > kafka-client calls "automatic" (flush commits at fixed intervals).
> > > > >
> > > > > On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov <sc...@gmail.com>
> > > wrote:
> > > > >
> > > > > > You don't have autocmmit enables that means you commit offsets
> > > > yourself -
> > > > > > correct? If you store them per partition somewhere and fail to
> > clean
> > > it
> > > > > up
> > > > > > upon rebalance next time the consumer gets this partition
> assigned
> > > > during
> > > > > > next rebalance it can commit old stale offset- can this be the
> > case?
> > > > > >
> > > > > >
> > > > > > fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <
> > > > > > dvsekhvalnov@gmail.com
> > > > > > >:
> > > > > >
> > > > > > > Reprocessing same events again - is fine for us (idempotent).
> > While
> > > > > > loosing
> > > > > > > data is more critical.
> > > > > > >
> > > > > > > What are reasons of such behaviour? Consumers are never idle,
> > > always
> > > > > > > commiting, probably something wrong with broker setup then?
> > > > > > >
> > > > > > > On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <yu...@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > > Stas:
> > > > > > > > bq.  using anything but none is not really an option
> > > > > > > >
> > > > > > > > If you have time, can you explain a bit more ?
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > > > > > On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <
> > schizhov@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > If you set auto.offset.reset to none next time it happens
> you
> > > > will
> > > > > be
> > > > > > > in
> > > > > > > > > much better position to find out what happens. Also in
> > general
> > > > with
> > > > > > > > current
> > > > > > > > > semantics of offset reset policy IMO using anything but
> none
> > is
> > > > not
> > > > > > > > really
> > > > > > > > > an option unless it is ok for consumer to loose some data
> > > > (latest)
> > > > > or
> > > > > > > > > reprocess it second time (earliest).
> > > > > > > > >
> > > > > > > > > fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <
> yuzhihong@gmail.com
> > >:
> > > > > > > > >
> > > > > > > > > > Should Kafka log warning if log.retention.hours is lower
> > than
> > > > > > number
> > > > > > > of
> > > > > > > > > > hours specified by offsets.retention.minutes ?
> > > > > > > > > >
> > > > > > > > > > On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <
> > > > > > manikumar.reddy@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > normally, log.retention.hours (168hrs)  should be
> higher
> > > than
> > > > > > > > > > > offsets.retention.minutes (336 hrs)?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy Vsekhvalnov <
> > > > > > > > > > > dvsekhvalnov@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Ted,
> > > > > > > > > > > >
> > > > > > > > > > > > Broker: v0.11.0.0
> > > > > > > > > > > >
> > > > > > > > > > > > Consumer:
> > > > > > > > > > > > kafka-clients v0.11.0.0
> > > > > > > > > > > > auto.offset.reset = earliest
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <
> > > > yuzhihong@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > What's the value for auto.offset.reset  ?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Which release are you using ?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy
> Vsekhvalnov <
> > > > > > > > > > > > > dvsekhvalnov@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > we several time faced situation where
> > consumer-group
> > > > > > started
> > > > > > > to
> > > > > > > > > > > > > re-consume
> > > > > > > > > > > > > > old events from beginning. Here is scenario:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. x3 broker kafka cluster on top of x3 node
> > > zookeeper
> > > > > > > > > > > > > > 2. RF=3 for all topics
> > > > > > > > > > > > > > 3. log.retention.hours=168 and
> > > > > > > offsets.retention.minutes=20160
> > > > > > > > > > > > > > 4. running sustainable load (pushing events)
> > > > > > > > > > > > > > 5. doing disaster testing by randomly shutting
> > down 1
> > > > of
> > > > > 3
> > > > > > > > broker
> > > > > > > > > > > nodes
> > > > > > > > > > > > > > (then provision new broker back)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Several times after bouncing broker we faced
> > > situation
> > > > > > where
> > > > > > > > > > consumer
> > > > > > > > > > > > > group
> > > > > > > > > > > > > > started to re-consume old events.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > consumer group:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. enable.auto.commit = false
> > > > > > > > > > > > > > 2. tried graceful group shutdown, kill -9 and
> > > > terminating
> > > > > > AWS
> > > > > > > > > nodes
> > > > > > > > > > > > > > 3. never experienced re-consumption for given
> > cases.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > What can cause that old events re-consumption? Is
> > it
> > > > > > related
> > > > > > > to
> > > > > > > > > > > > bouncing
> > > > > > > > > > > > > > one of brokers? What to search in a logs? Any
> > broker
> > > > > > settings
> > > > > > > > to
> > > > > > > > > > try?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks in advance.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > > --
> > > > The information transmitted is intended only for the person or entity
> > to
> > > > which it is addressed and may contain confidential and/or privileged
> > > > material. Any review, retransmission, dissemination or other use of,
> or
> > > > taking of any action in reliance upon, this information by persons or
> > > > entities other than the intended recipient is prohibited. If you
> > received
> > > > this in error, please contact the sender and delete the material from
> > any
> > > > computer.
> > > >
> > >
> >
>

Re: kafka broker loosing offsets?

Posted by Stas Chizhov <sc...@gmail.com>.

So no matter in what sequence you shutdown brokers it is only 1 that causes
the major problem? That would indeed be a bit weird. have you checked
offsets of your consumer - right after offsets jump back - does it start
from the topic start or does it go back to some random position? Have you
checked if all offsets are actually being committed by consumers?

fre 6 okt. 2017 kl. 20:59 skrev Dmitriy Vsekhvalnov <dvsekhvalnov@gmail.com
>:

> Yeah, probably we can dig around.
>
> One more observation, the most lag/re-consumption trouble happening when we
> kill broker with lowest id (e.g. 100 from [100,101,102]).
> When crashing other brokers - there is nothing special happening, lag
> growing little bit but nothing crazy (e.g. thousands, not millions).
>
> Is it sounds suspicious?
>
> On Fri, Oct 6, 2017 at 9:23 PM, Stas Chizhov <sc...@gmail.com> wrote:
>
> > Ted: when choosing earliest/latest you are saying: if it happens that
> there
> > is no "valid" offset committed for a consumer (for whatever reason:
> > bug/misconfiguration/no luck) it will be ok to start from the beginning
> or
> > end of the topic. So if you are not ok with that you should choose none.
> >
> > Dmitriy: Ok. Then it is spring-kafka that maintains this offset per
> > partition state for you. it might also has that problem of leaving stale
> > offsets lying around, After quickly looking through
> > https://github.com/spring-projects/spring-kafka/blob/
> > 1945f29d5518e3c4a9950ba82135420dfb61e808/spring-kafka/src/
> > main/java/org/springframework/kafka/listener/
> > KafkaMessageListenerContainer.java
> > it looks possible since offsets map is not cleared upon partition
> > revocation, but that is just a hypothesis. I have no experience with
> > spring-kafka. However since you say you consumers were always active I
> find
> > this theory worth investigating.
> >
> >
> > 2017-10-06 18:20 GMT+02:00 Vincent Dautremont <
> > vincent.dautremont@olamobile.com.invalid>:
> >
> > > is there a way to read messages on a topic partition from a specific
> node
> > > we that we choose (and not by the topic partition leader) ?
> > > I would like to read myself that each of the __consumer_offsets
> partition
> > > replicas have the same consumer group offset written in it in it.
> > >
> > > On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov <
> > > dvsekhvalnov@gmail.com>
> > > wrote:
> > >
> > > > Stas:
> > > >
> > > > we rely on spring-kafka, it  commits offsets "manually" for us after
> > > event
> > > > handler completed. So it's kind of automatic once there is constant
> > > stream
> > > > of events (no idle time, which is true for us). Though it's not what
> > pure
> > > > kafka-client calls "automatic" (flush commits at fixed intervals).
> > > >
> > > > On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov <sc...@gmail.com>
> > wrote:
> > > >
> > > > > You don't have autocmmit enables that means you commit offsets
> > > yourself -
> > > > > correct? If you store them per partition somewhere and fail to
> clean
> > it
> > > > up
> > > > > upon rebalance next time the consumer gets this partition assigned
> > > during
> > > > > next rebalance it can commit old stale offset- can this be the
> case?
> > > > >
> > > > >
> > > > > fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <
> > > > > dvsekhvalnov@gmail.com
> > > > > >:
> > > > >
> > > > > > Reprocessing same events again - is fine for us (idempotent).
> While
> > > > > loosing
> > > > > > data is more critical.
> > > > > >
> > > > > > What are reasons of such behaviour? Consumers are never idle,
> > always
> > > > > > commiting, probably something wrong with broker setup then?
> > > > > >
> > > > > > On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <yu...@gmail.com>
> > wrote:
> > > > > >
> > > > > > > Stas:
> > > > > > > bq.  using anything but none is not really an option
> > > > > > >
> > > > > > > If you have time, can you explain a bit more ?
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <
> schizhov@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > If you set auto.offset.reset to none next time it happens you
> > > will
> > > > be
> > > > > > in
> > > > > > > > much better position to find out what happens. Also in
> general
> > > with
> > > > > > > current
> > > > > > > > semantics of offset reset policy IMO using anything but none
> is
> > > not
> > > > > > > really
> > > > > > > > an option unless it is ok for consumer to loose some data
> > > (latest)
> > > > or
> > > > > > > > reprocess it second time (earliest).
> > > > > > > >
> > > > > > > > fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <yuzhihong@gmail.com
> >:
> > > > > > > >
> > > > > > > > > Should Kafka log warning if log.retention.hours is lower
> than
> > > > > number
> > > > > > of
> > > > > > > > > hours specified by offsets.retention.minutes ?
> > > > > > > > >
> > > > > > > > > On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <
> > > > > manikumar.reddy@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > normally, log.retention.hours (168hrs)  should be higher
> > than
> > > > > > > > > > offsets.retention.minutes (336 hrs)?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy Vsekhvalnov <
> > > > > > > > > > dvsekhvalnov@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Ted,
> > > > > > > > > > >
> > > > > > > > > > > Broker: v0.11.0.0
> > > > > > > > > > >
> > > > > > > > > > > Consumer:
> > > > > > > > > > > kafka-clients v0.11.0.0
> > > > > > > > > > > auto.offset.reset = earliest
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <
> > > yuzhihong@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > What's the value for auto.offset.reset  ?
> > > > > > > > > > > >
> > > > > > > > > > > > Which release are you using ?
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy Vsekhvalnov <
> > > > > > > > > > > > dvsekhvalnov@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > >
> > > > > > > > > > > > > we several time faced situation where
> consumer-group
> > > > > started
> > > > > > to
> > > > > > > > > > > > re-consume
> > > > > > > > > > > > > old events from beginning. Here is scenario:
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. x3 broker kafka cluster on top of x3 node
> > zookeeper
> > > > > > > > > > > > > 2. RF=3 for all topics
> > > > > > > > > > > > > 3. log.retention.hours=168 and
> > > > > > offsets.retention.minutes=20160
> > > > > > > > > > > > > 4. running sustainable load (pushing events)
> > > > > > > > > > > > > 5. doing disaster testing by randomly shutting
> down 1
> > > of
> > > > 3
> > > > > > > broker
> > > > > > > > > > nodes
> > > > > > > > > > > > > (then provision new broker back)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Several times after bouncing broker we faced
> > situation
> > > > > where
> > > > > > > > > consumer
> > > > > > > > > > > > group
> > > > > > > > > > > > > started to re-consume old events.
> > > > > > > > > > > > >
> > > > > > > > > > > > > consumer group:
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. enable.auto.commit = false
> > > > > > > > > > > > > 2. tried graceful group shutdown, kill -9 and
> > > terminating
> > > > > AWS
> > > > > > > > nodes
> > > > > > > > > > > > > 3. never experienced re-consumption for given
> cases.
> > > > > > > > > > > > >
> > > > > > > > > > > > > What can cause that old events re-consumption? Is
> it
> > > > > related
> > > > > > to
> > > > > > > > > > > bouncing
> > > > > > > > > > > > > one of brokers? What to search in a logs? Any
> broker
> > > > > settings
> > > > > > > to
> > > > > > > > > try?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks in advance.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > > --
> > > The information transmitted is intended only for the person or entity
> to
> > > which it is addressed and may contain confidential and/or privileged
> > > material. Any review, retransmission, dissemination or other use of, or
> > > taking of any action in reliance upon, this information by persons or
> > > entities other than the intended recipient is prohibited. If you
> received
> > > this in error, please contact the sender and delete the material from
> any
> > > computer.
> > >
> >
>

Re: kafka broker loosing offsets?

Posted by Dmitriy Vsekhvalnov <dv...@gmail.com>.

Yeah, probably we can dig around.

One more observation, the most lag/re-consumption trouble happening when we
kill broker with lowest id (e.g. 100 from [100,101,102]).
When crashing other brokers - there is nothing special happening, lag
growing little bit but nothing crazy (e.g. thousands, not millions).

Is it sounds suspicious?

On Fri, Oct 6, 2017 at 9:23 PM, Stas Chizhov <sc...@gmail.com> wrote:

> Ted: when choosing earliest/latest you are saying: if it happens that there
> is no "valid" offset committed for a consumer (for whatever reason:
> bug/misconfiguration/no luck) it will be ok to start from the beginning or
> end of the topic. So if you are not ok with that you should choose none.
>
> Dmitriy: Ok. Then it is spring-kafka that maintains this offset per
> partition state for you. it might also has that problem of leaving stale
> offsets lying around, After quickly looking through
> https://github.com/spring-projects/spring-kafka/blob/
> 1945f29d5518e3c4a9950ba82135420dfb61e808/spring-kafka/src/
> main/java/org/springframework/kafka/listener/
> KafkaMessageListenerContainer.java
> it looks possible since offsets map is not cleared upon partition
> revocation, but that is just a hypothesis. I have no experience with
> spring-kafka. However since you say you consumers were always active I find
> this theory worth investigating.
>
>
> 2017-10-06 18:20 GMT+02:00 Vincent Dautremont <
> vincent.dautremont@olamobile.com.invalid>:
>
> > is there a way to read messages on a topic partition from a specific node
> > we that we choose (and not by the topic partition leader) ?
> > I would like to read myself that each of the __consumer_offsets partition
> > replicas have the same consumer group offset written in it in it.
> >
> > On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov <
> > dvsekhvalnov@gmail.com>
> > wrote:
> >
> > > Stas:
> > >
> > > we rely on spring-kafka, it  commits offsets "manually" for us after
> > event
> > > handler completed. So it's kind of automatic once there is constant
> > stream
> > > of events (no idle time, which is true for us). Though it's not what
> pure
> > > kafka-client calls "automatic" (flush commits at fixed intervals).
> > >
> > > On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov <sc...@gmail.com>
> wrote:
> > >
> > > > You don't have autocmmit enables that means you commit offsets
> > yourself -
> > > > correct? If you store them per partition somewhere and fail to clean
> it
> > > up
> > > > upon rebalance next time the consumer gets this partition assigned
> > during
> > > > next rebalance it can commit old stale offset- can this be the case?
> > > >
> > > >
> > > > fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <
> > > > dvsekhvalnov@gmail.com
> > > > >:
> > > >
> > > > > Reprocessing same events again - is fine for us (idempotent). While
> > > > loosing
> > > > > data is more critical.
> > > > >
> > > > > What are reasons of such behaviour? Consumers are never idle,
> always
> > > > > commiting, probably something wrong with broker setup then?
> > > > >
> > > > > On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <yu...@gmail.com>
> wrote:
> > > > >
> > > > > > Stas:
> > > > > > bq.  using anything but none is not really an option
> > > > > >
> > > > > > If you have time, can you explain a bit more ?
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <schizhov@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > If you set auto.offset.reset to none next time it happens you
> > will
> > > be
> > > > > in
> > > > > > > much better position to find out what happens. Also in general
> > with
> > > > > > current
> > > > > > > semantics of offset reset policy IMO using anything but none is
> > not
> > > > > > really
> > > > > > > an option unless it is ok for consumer to loose some data
> > (latest)
> > > or
> > > > > > > reprocess it second time (earliest).
> > > > > > >
> > > > > > > fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <yu...@gmail.com>:
> > > > > > >
> > > > > > > > Should Kafka log warning if log.retention.hours is lower than
> > > > number
> > > > > of
> > > > > > > > hours specified by offsets.retention.minutes ?
> > > > > > > >
> > > > > > > > On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <
> > > > manikumar.reddy@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > normally, log.retention.hours (168hrs)  should be higher
> than
> > > > > > > > > offsets.retention.minutes (336 hrs)?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy Vsekhvalnov <
> > > > > > > > > dvsekhvalnov@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Ted,
> > > > > > > > > >
> > > > > > > > > > Broker: v0.11.0.0
> > > > > > > > > >
> > > > > > > > > > Consumer:
> > > > > > > > > > kafka-clients v0.11.0.0
> > > > > > > > > > auto.offset.reset = earliest
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <
> > yuzhihong@gmail.com>
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > What's the value for auto.offset.reset  ?
> > > > > > > > > > >
> > > > > > > > > > > Which release are you using ?
> > > > > > > > > > >
> > > > > > > > > > > Cheers
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy Vsekhvalnov <
> > > > > > > > > > > dvsekhvalnov@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi all,
> > > > > > > > > > > >
> > > > > > > > > > > > we several time faced situation where consumer-group
> > > > started
> > > > > to
> > > > > > > > > > > re-consume
> > > > > > > > > > > > old events from beginning. Here is scenario:
> > > > > > > > > > > >
> > > > > > > > > > > > 1. x3 broker kafka cluster on top of x3 node
> zookeeper
> > > > > > > > > > > > 2. RF=3 for all topics
> > > > > > > > > > > > 3. log.retention.hours=168 and
> > > > > offsets.retention.minutes=20160
> > > > > > > > > > > > 4. running sustainable load (pushing events)
> > > > > > > > > > > > 5. doing disaster testing by randomly shutting down 1
> > of
> > > 3
> > > > > > broker
> > > > > > > > > nodes
> > > > > > > > > > > > (then provision new broker back)
> > > > > > > > > > > >
> > > > > > > > > > > > Several times after bouncing broker we faced
> situation
> > > > where
> > > > > > > > consumer
> > > > > > > > > > > group
> > > > > > > > > > > > started to re-consume old events.
> > > > > > > > > > > >
> > > > > > > > > > > > consumer group:
> > > > > > > > > > > >
> > > > > > > > > > > > 1. enable.auto.commit = false
> > > > > > > > > > > > 2. tried graceful group shutdown, kill -9 and
> > terminating
> > > > AWS
> > > > > > > nodes
> > > > > > > > > > > > 3. never experienced re-consumption for given cases.
> > > > > > > > > > > >
> > > > > > > > > > > > What can cause that old events re-consumption? Is it
> > > > related
> > > > > to
> > > > > > > > > > bouncing
> > > > > > > > > > > > one of brokers? What to search in a logs? Any broker
> > > > settings
> > > > > > to
> > > > > > > > try?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks in advance.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > --
> > The information transmitted is intended only for the person or entity to
> > which it is addressed and may contain confidential and/or privileged
> > material. Any review, retransmission, dissemination or other use of, or
> > taking of any action in reliance upon, this information by persons or
> > entities other than the intended recipient is prohibited. If you received
> > this in error, please contact the sender and delete the material from any
> > computer.
> >
>

Re: kafka broker loosing offsets?

Posted by Stas Chizhov <sc...@gmail.com>.

Ted: when choosing earliest/latest you are saying: if it happens that there
is no "valid" offset committed for a consumer (for whatever reason:
bug/misconfiguration/no luck) it will be ok to start from the beginning or
end of the topic. So if you are not ok with that you should choose none.

Dmitriy: Ok. Then it is spring-kafka that maintains this offset per
partition state for you. it might also has that problem of leaving stale
offsets lying around, After quickly looking through
https://github.com/spring-projects/spring-kafka/blob/1945f29d5518e3c4a9950ba82135420dfb61e808/spring-kafka/src/main/java/org/springframework/kafka/listener/KafkaMessageListenerContainer.java
it looks possible since offsets map is not cleared upon partition
revocation, but that is just a hypothesis. I have no experience with
spring-kafka. However since you say you consumers were always active I find
this theory worth investigating.


2017-10-06 18:20 GMT+02:00 Vincent Dautremont <
vincent.dautremont@olamobile.com.invalid>:

> is there a way to read messages on a topic partition from a specific node
> we that we choose (and not by the topic partition leader) ?
> I would like to read myself that each of the __consumer_offsets partition
> replicas have the same consumer group offset written in it in it.
>
> On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov <
> dvsekhvalnov@gmail.com>
> wrote:
>
> > Stas:
> >
> > we rely on spring-kafka, it  commits offsets "manually" for us after
> event
> > handler completed. So it's kind of automatic once there is constant
> stream
> > of events (no idle time, which is true for us). Though it's not what pure
> > kafka-client calls "automatic" (flush commits at fixed intervals).
> >
> > On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov <sc...@gmail.com> wrote:
> >
> > > You don't have autocmmit enables that means you commit offsets
> yourself -
> > > correct? If you store them per partition somewhere and fail to clean it
> > up
> > > upon rebalance next time the consumer gets this partition assigned
> during
> > > next rebalance it can commit old stale offset- can this be the case?
> > >
> > >
> > > fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <
> > > dvsekhvalnov@gmail.com
> > > >:
> > >
> > > > Reprocessing same events again - is fine for us (idempotent). While
> > > loosing
> > > > data is more critical.
> > > >
> > > > What are reasons of such behaviour? Consumers are never idle, always
> > > > commiting, probably something wrong with broker setup then?
> > > >
> > > > On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <yu...@gmail.com> wrote:
> > > >
> > > > > Stas:
> > > > > bq.  using anything but none is not really an option
> > > > >
> > > > > If you have time, can you explain a bit more ?
> > > > >
> > > > > Thanks
> > > > >
> > > > > On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <sc...@gmail.com>
> > > wrote:
> > > > >
> > > > > > If you set auto.offset.reset to none next time it happens you
> will
> > be
> > > > in
> > > > > > much better position to find out what happens. Also in general
> with
> > > > > current
> > > > > > semantics of offset reset policy IMO using anything but none is
> not
> > > > > really
> > > > > > an option unless it is ok for consumer to loose some data
> (latest)
> > or
> > > > > > reprocess it second time (earliest).
> > > > > >
> > > > > > fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <yu...@gmail.com>:
> > > > > >
> > > > > > > Should Kafka log warning if log.retention.hours is lower than
> > > number
> > > > of
> > > > > > > hours specified by offsets.retention.minutes ?
> > > > > > >
> > > > > > > On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <
> > > manikumar.reddy@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > normally, log.retention.hours (168hrs)  should be higher than
> > > > > > > > offsets.retention.minutes (336 hrs)?
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy Vsekhvalnov <
> > > > > > > > dvsekhvalnov@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Ted,
> > > > > > > > >
> > > > > > > > > Broker: v0.11.0.0
> > > > > > > > >
> > > > > > > > > Consumer:
> > > > > > > > > kafka-clients v0.11.0.0
> > > > > > > > > auto.offset.reset = earliest
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <
> yuzhihong@gmail.com>
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > What's the value for auto.offset.reset  ?
> > > > > > > > > >
> > > > > > > > > > Which release are you using ?
> > > > > > > > > >
> > > > > > > > > > Cheers
> > > > > > > > > >
> > > > > > > > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy Vsekhvalnov <
> > > > > > > > > > dvsekhvalnov@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > we several time faced situation where consumer-group
> > > started
> > > > to
> > > > > > > > > > re-consume
> > > > > > > > > > > old events from beginning. Here is scenario:
> > > > > > > > > > >
> > > > > > > > > > > 1. x3 broker kafka cluster on top of x3 node zookeeper
> > > > > > > > > > > 2. RF=3 for all topics
> > > > > > > > > > > 3. log.retention.hours=168 and
> > > > offsets.retention.minutes=20160
> > > > > > > > > > > 4. running sustainable load (pushing events)
> > > > > > > > > > > 5. doing disaster testing by randomly shutting down 1
> of
> > 3
> > > > > broker
> > > > > > > > nodes
> > > > > > > > > > > (then provision new broker back)
> > > > > > > > > > >
> > > > > > > > > > > Several times after bouncing broker we faced situation
> > > where
> > > > > > > consumer
> > > > > > > > > > group
> > > > > > > > > > > started to re-consume old events.
> > > > > > > > > > >
> > > > > > > > > > > consumer group:
> > > > > > > > > > >
> > > > > > > > > > > 1. enable.auto.commit = false
> > > > > > > > > > > 2. tried graceful group shutdown, kill -9 and
> terminating
> > > AWS
> > > > > > nodes
> > > > > > > > > > > 3. never experienced re-consumption for given cases.
> > > > > > > > > > >
> > > > > > > > > > > What can cause that old events re-consumption? Is it
> > > related
> > > > to
> > > > > > > > > bouncing
> > > > > > > > > > > one of brokers? What to search in a logs? Any broker
> > > settings
> > > > > to
> > > > > > > try?
> > > > > > > > > > >
> > > > > > > > > > > Thanks in advance.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
> --
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from any
> computer.
>

Re: kafka broker loosing offsets?

Posted by Ted Yu <yu...@gmail.com>.

A brief search brought me to related discussion on this JIRA:

https://issues.apache.org/jira/browse/KAFKA-3806?focusedCommentId=15906349&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15906349

FYI

On Fri, Oct 6, 2017 at 10:37 AM, Manikumar <ma...@gmail.com>
wrote:

> @Ted  Yes, I think we should add log warning message.
>
> On Fri, Oct 6, 2017 at 9:50 PM, Vincent Dautremont <
> vincent.dautremont@olamobile.com.invalid> wrote:
>
> > is there a way to read messages on a topic partition from a specific node
> > we that we choose (and not by the topic partition leader) ?
> > I would like to read myself that each of the __consumer_offsets partition
> > replicas have the same consumer group offset written in it in it.
> >
> > On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov <
> > dvsekhvalnov@gmail.com>
> > wrote:
> >
> > > Stas:
> > >
> > > we rely on spring-kafka, it  commits offsets "manually" for us after
> > event
> > > handler completed. So it's kind of automatic once there is constant
> > stream
> > > of events (no idle time, which is true for us). Though it's not what
> pure
> > > kafka-client calls "automatic" (flush commits at fixed intervals).
> > >
> > > On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov <sc...@gmail.com>
> wrote:
> > >
> > > > You don't have autocmmit enables that means you commit offsets
> > yourself -
> > > > correct? If you store them per partition somewhere and fail to clean
> it
> > > up
> > > > upon rebalance next time the consumer gets this partition assigned
> > during
> > > > next rebalance it can commit old stale offset- can this be the case?
> > > >
> > > >
> > > > fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <
> > > > dvsekhvalnov@gmail.com
> > > > >:
> > > >
> > > > > Reprocessing same events again - is fine for us (idempotent). While
> > > > loosing
> > > > > data is more critical.
> > > > >
> > > > > What are reasons of such behaviour? Consumers are never idle,
> always
> > > > > commiting, probably something wrong with broker setup then?
> > > > >
> > > > > On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <yu...@gmail.com>
> wrote:
> > > > >
> > > > > > Stas:
> > > > > > bq.  using anything but none is not really an option
> > > > > >
> > > > > > If you have time, can you explain a bit more ?
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <schizhov@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > If you set auto.offset.reset to none next time it happens you
> > will
> > > be
> > > > > in
> > > > > > > much better position to find out what happens. Also in general
> > with
> > > > > > current
> > > > > > > semantics of offset reset policy IMO using anything but none is
> > not
> > > > > > really
> > > > > > > an option unless it is ok for consumer to loose some data
> > (latest)
> > > or
> > > > > > > reprocess it second time (earliest).
> > > > > > >
> > > > > > > fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <yu...@gmail.com>:
> > > > > > >
> > > > > > > > Should Kafka log warning if log.retention.hours is lower than
> > > > number
> > > > > of
> > > > > > > > hours specified by offsets.retention.minutes ?
> > > > > > > >
> > > > > > > > On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <
> > > > manikumar.reddy@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > normally, log.retention.hours (168hrs)  should be higher
> than
> > > > > > > > > offsets.retention.minutes (336 hrs)?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy Vsekhvalnov <
> > > > > > > > > dvsekhvalnov@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Ted,
> > > > > > > > > >
> > > > > > > > > > Broker: v0.11.0.0
> > > > > > > > > >
> > > > > > > > > > Consumer:
> > > > > > > > > > kafka-clients v0.11.0.0
> > > > > > > > > > auto.offset.reset = earliest
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <
> > yuzhihong@gmail.com>
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > What's the value for auto.offset.reset  ?
> > > > > > > > > > >
> > > > > > > > > > > Which release are you using ?
> > > > > > > > > > >
> > > > > > > > > > > Cheers
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy Vsekhvalnov <
> > > > > > > > > > > dvsekhvalnov@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi all,
> > > > > > > > > > > >
> > > > > > > > > > > > we several time faced situation where consumer-group
> > > > started
> > > > > to
> > > > > > > > > > > re-consume
> > > > > > > > > > > > old events from beginning. Here is scenario:
> > > > > > > > > > > >
> > > > > > > > > > > > 1. x3 broker kafka cluster on top of x3 node
> zookeeper
> > > > > > > > > > > > 2. RF=3 for all topics
> > > > > > > > > > > > 3. log.retention.hours=168 and
> > > > > offsets.retention.minutes=20160
> > > > > > > > > > > > 4. running sustainable load (pushing events)
> > > > > > > > > > > > 5. doing disaster testing by randomly shutting down 1
> > of
> > > 3
> > > > > > broker
> > > > > > > > > nodes
> > > > > > > > > > > > (then provision new broker back)
> > > > > > > > > > > >
> > > > > > > > > > > > Several times after bouncing broker we faced
> situation
> > > > where
> > > > > > > > consumer
> > > > > > > > > > > group
> > > > > > > > > > > > started to re-consume old events.
> > > > > > > > > > > >
> > > > > > > > > > > > consumer group:
> > > > > > > > > > > >
> > > > > > > > > > > > 1. enable.auto.commit = false
> > > > > > > > > > > > 2. tried graceful group shutdown, kill -9 and
> > terminating
> > > > AWS
> > > > > > > nodes
> > > > > > > > > > > > 3. never experienced re-consumption for given cases.
> > > > > > > > > > > >
> > > > > > > > > > > > What can cause that old events re-consumption? Is it
> > > > related
> > > > > to
> > > > > > > > > > bouncing
> > > > > > > > > > > > one of brokers? What to search in a logs? Any broker
> > > > settings
> > > > > > to
> > > > > > > > try?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks in advance.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > --
> > The information transmitted is intended only for the person or entity to
> > which it is addressed and may contain confidential and/or privileged
> > material. Any review, retransmission, dissemination or other use of, or
> > taking of any action in reliance upon, this information by persons or
> > entities other than the intended recipient is prohibited. If you received
> > this in error, please contact the sender and delete the material from any
> > computer.
> >
>

Re: kafka broker loosing offsets?

Posted by Manikumar <ma...@gmail.com>.

@Ted  Yes, I think we should add log warning message.

On Fri, Oct 6, 2017 at 9:50 PM, Vincent Dautremont <
vincent.dautremont@olamobile.com.invalid> wrote:

> is there a way to read messages on a topic partition from a specific node
> we that we choose (and not by the topic partition leader) ?
> I would like to read myself that each of the __consumer_offsets partition
> replicas have the same consumer group offset written in it in it.
>
> On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov <
> dvsekhvalnov@gmail.com>
> wrote:
>
> > Stas:
> >
> > we rely on spring-kafka, it  commits offsets "manually" for us after
> event
> > handler completed. So it's kind of automatic once there is constant
> stream
> > of events (no idle time, which is true for us). Though it's not what pure
> > kafka-client calls "automatic" (flush commits at fixed intervals).
> >
> > On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov <sc...@gmail.com> wrote:
> >
> > > You don't have autocmmit enables that means you commit offsets
> yourself -
> > > correct? If you store them per partition somewhere and fail to clean it
> > up
> > > upon rebalance next time the consumer gets this partition assigned
> during
> > > next rebalance it can commit old stale offset- can this be the case?
> > >
> > >
> > > fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <
> > > dvsekhvalnov@gmail.com
> > > >:
> > >
> > > > Reprocessing same events again - is fine for us (idempotent). While
> > > loosing
> > > > data is more critical.
> > > >
> > > > What are reasons of such behaviour? Consumers are never idle, always
> > > > commiting, probably something wrong with broker setup then?
> > > >
> > > > On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <yu...@gmail.com> wrote:
> > > >
> > > > > Stas:
> > > > > bq.  using anything but none is not really an option
> > > > >
> > > > > If you have time, can you explain a bit more ?
> > > > >
> > > > > Thanks
> > > > >
> > > > > On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <sc...@gmail.com>
> > > wrote:
> > > > >
> > > > > > If you set auto.offset.reset to none next time it happens you
> will
> > be
> > > > in
> > > > > > much better position to find out what happens. Also in general
> with
> > > > > current
> > > > > > semantics of offset reset policy IMO using anything but none is
> not
> > > > > really
> > > > > > an option unless it is ok for consumer to loose some data
> (latest)
> > or
> > > > > > reprocess it second time (earliest).
> > > > > >
> > > > > > fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <yu...@gmail.com>:
> > > > > >
> > > > > > > Should Kafka log warning if log.retention.hours is lower than
> > > number
> > > > of
> > > > > > > hours specified by offsets.retention.minutes ?
> > > > > > >
> > > > > > > On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <
> > > manikumar.reddy@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > normally, log.retention.hours (168hrs)  should be higher than
> > > > > > > > offsets.retention.minutes (336 hrs)?
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy Vsekhvalnov <
> > > > > > > > dvsekhvalnov@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Ted,
> > > > > > > > >
> > > > > > > > > Broker: v0.11.0.0
> > > > > > > > >
> > > > > > > > > Consumer:
> > > > > > > > > kafka-clients v0.11.0.0
> > > > > > > > > auto.offset.reset = earliest
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <
> yuzhihong@gmail.com>
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > What's the value for auto.offset.reset  ?
> > > > > > > > > >
> > > > > > > > > > Which release are you using ?
> > > > > > > > > >
> > > > > > > > > > Cheers
> > > > > > > > > >
> > > > > > > > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy Vsekhvalnov <
> > > > > > > > > > dvsekhvalnov@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > we several time faced situation where consumer-group
> > > started
> > > > to
> > > > > > > > > > re-consume
> > > > > > > > > > > old events from beginning. Here is scenario:
> > > > > > > > > > >
> > > > > > > > > > > 1. x3 broker kafka cluster on top of x3 node zookeeper
> > > > > > > > > > > 2. RF=3 for all topics
> > > > > > > > > > > 3. log.retention.hours=168 and
> > > > offsets.retention.minutes=20160
> > > > > > > > > > > 4. running sustainable load (pushing events)
> > > > > > > > > > > 5. doing disaster testing by randomly shutting down 1
> of
> > 3
> > > > > broker
> > > > > > > > nodes
> > > > > > > > > > > (then provision new broker back)
> > > > > > > > > > >
> > > > > > > > > > > Several times after bouncing broker we faced situation
> > > where
> > > > > > > consumer
> > > > > > > > > > group
> > > > > > > > > > > started to re-consume old events.
> > > > > > > > > > >
> > > > > > > > > > > consumer group:
> > > > > > > > > > >
> > > > > > > > > > > 1. enable.auto.commit = false
> > > > > > > > > > > 2. tried graceful group shutdown, kill -9 and
> terminating
> > > AWS
> > > > > > nodes
> > > > > > > > > > > 3. never experienced re-consumption for given cases.
> > > > > > > > > > >
> > > > > > > > > > > What can cause that old events re-consumption? Is it
> > > related
> > > > to
> > > > > > > > > bouncing
> > > > > > > > > > > one of brokers? What to search in a logs? Any broker
> > > settings
> > > > > to
> > > > > > > try?
> > > > > > > > > > >
> > > > > > > > > > > Thanks in advance.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
> --
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from any
> computer.
>

Re: kafka broker loosing offsets?

Posted by Vincent Dautremont <vi...@olamobile.com.INVALID>.

is there a way to read messages on a topic partition from a specific node
we that we choose (and not by the topic partition leader) ?
I would like to read myself that each of the __consumer_offsets partition
replicas have the same consumer group offset written in it in it.

On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov <dv...@gmail.com>
wrote:

> Stas:
>
> we rely on spring-kafka, it  commits offsets "manually" for us after event
> handler completed. So it's kind of automatic once there is constant stream
> of events (no idle time, which is true for us). Though it's not what pure
> kafka-client calls "automatic" (flush commits at fixed intervals).
>
> On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov <sc...@gmail.com> wrote:
>
> > You don't have autocmmit enables that means you commit offsets yourself -
> > correct? If you store them per partition somewhere and fail to clean it
> up
> > upon rebalance next time the consumer gets this partition assigned during
> > next rebalance it can commit old stale offset- can this be the case?
> >
> >
> > fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <
> > dvsekhvalnov@gmail.com
> > >:
> >
> > > Reprocessing same events again - is fine for us (idempotent). While
> > loosing
> > > data is more critical.
> > >
> > > What are reasons of such behaviour? Consumers are never idle, always
> > > commiting, probably something wrong with broker setup then?
> > >
> > > On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > > Stas:
> > > > bq.  using anything but none is not really an option
> > > >
> > > > If you have time, can you explain a bit more ?
> > > >
> > > > Thanks
> > > >
> > > > On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <sc...@gmail.com>
> > wrote:
> > > >
> > > > > If you set auto.offset.reset to none next time it happens you will
> be
> > > in
> > > > > much better position to find out what happens. Also in general with
> > > > current
> > > > > semantics of offset reset policy IMO using anything but none is not
> > > > really
> > > > > an option unless it is ok for consumer to loose some data (latest)
> or
> > > > > reprocess it second time (earliest).
> > > > >
> > > > > fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <yu...@gmail.com>:
> > > > >
> > > > > > Should Kafka log warning if log.retention.hours is lower than
> > number
> > > of
> > > > > > hours specified by offsets.retention.minutes ?
> > > > > >
> > > > > > On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <
> > manikumar.reddy@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > normally, log.retention.hours (168hrs)  should be higher than
> > > > > > > offsets.retention.minutes (336 hrs)?
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy Vsekhvalnov <
> > > > > > > dvsekhvalnov@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Ted,
> > > > > > > >
> > > > > > > > Broker: v0.11.0.0
> > > > > > > >
> > > > > > > > Consumer:
> > > > > > > > kafka-clients v0.11.0.0
> > > > > > > > auto.offset.reset = earliest
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <yu...@gmail.com>
> > > > wrote:
> > > > > > > >
> > > > > > > > > What's the value for auto.offset.reset  ?
> > > > > > > > >
> > > > > > > > > Which release are you using ?
> > > > > > > > >
> > > > > > > > > Cheers
> > > > > > > > >
> > > > > > > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy Vsekhvalnov <
> > > > > > > > > dvsekhvalnov@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > we several time faced situation where consumer-group
> > started
> > > to
> > > > > > > > > re-consume
> > > > > > > > > > old events from beginning. Here is scenario:
> > > > > > > > > >
> > > > > > > > > > 1. x3 broker kafka cluster on top of x3 node zookeeper
> > > > > > > > > > 2. RF=3 for all topics
> > > > > > > > > > 3. log.retention.hours=168 and
> > > offsets.retention.minutes=20160
> > > > > > > > > > 4. running sustainable load (pushing events)
> > > > > > > > > > 5. doing disaster testing by randomly shutting down 1 of
> 3
> > > > broker
> > > > > > > nodes
> > > > > > > > > > (then provision new broker back)
> > > > > > > > > >
> > > > > > > > > > Several times after bouncing broker we faced situation
> > where
> > > > > > consumer
> > > > > > > > > group
> > > > > > > > > > started to re-consume old events.
> > > > > > > > > >
> > > > > > > > > > consumer group:
> > > > > > > > > >
> > > > > > > > > > 1. enable.auto.commit = false
> > > > > > > > > > 2. tried graceful group shutdown, kill -9 and terminating
> > AWS
> > > > > nodes
> > > > > > > > > > 3. never experienced re-consumption for given cases.
> > > > > > > > > >
> > > > > > > > > > What can cause that old events re-consumption? Is it
> > related
> > > to
> > > > > > > > bouncing
> > > > > > > > > > one of brokers? What to search in a logs? Any broker
> > settings
> > > > to
> > > > > > try?
> > > > > > > > > >
> > > > > > > > > > Thanks in advance.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

-- 
The information transmitted is intended only for the person or entity to 
which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or 
taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received 
this in error, please contact the sender and delete the material from any 
computer.

Re: kafka broker loosing offsets?

Posted by Dmitriy Vsekhvalnov <dv...@gmail.com>.

Stas:

we rely on spring-kafka, it  commits offsets "manually" for us after event
handler completed. So it's kind of automatic once there is constant stream
of events (no idle time, which is true for us). Though it's not what pure
kafka-client calls "automatic" (flush commits at fixed intervals).

On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov <sc...@gmail.com> wrote:

> You don't have autocmmit enables that means you commit offsets yourself -
> correct? If you store them per partition somewhere and fail to clean it up
> upon rebalance next time the consumer gets this partition assigned during
> next rebalance it can commit old stale offset- can this be the case?
>
>
> fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <
> dvsekhvalnov@gmail.com
> >:
>
> > Reprocessing same events again - is fine for us (idempotent). While
> loosing
> > data is more critical.
> >
> > What are reasons of such behaviour? Consumers are never idle, always
> > commiting, probably something wrong with broker setup then?
> >
> > On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > Stas:
> > > bq.  using anything but none is not really an option
> > >
> > > If you have time, can you explain a bit more ?
> > >
> > > Thanks
> > >
> > > On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <sc...@gmail.com>
> wrote:
> > >
> > > > If you set auto.offset.reset to none next time it happens you will be
> > in
> > > > much better position to find out what happens. Also in general with
> > > current
> > > > semantics of offset reset policy IMO using anything but none is not
> > > really
> > > > an option unless it is ok for consumer to loose some data (latest) or
> > > > reprocess it second time (earliest).
> > > >
> > > > fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <yu...@gmail.com>:
> > > >
> > > > > Should Kafka log warning if log.retention.hours is lower than
> number
> > of
> > > > > hours specified by offsets.retention.minutes ?
> > > > >
> > > > > On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <
> manikumar.reddy@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > normally, log.retention.hours (168hrs)  should be higher than
> > > > > > offsets.retention.minutes (336 hrs)?
> > > > > >
> > > > > >
> > > > > > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy Vsekhvalnov <
> > > > > > dvsekhvalnov@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Ted,
> > > > > > >
> > > > > > > Broker: v0.11.0.0
> > > > > > >
> > > > > > > Consumer:
> > > > > > > kafka-clients v0.11.0.0
> > > > > > > auto.offset.reset = earliest
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <yu...@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > > What's the value for auto.offset.reset  ?
> > > > > > > >
> > > > > > > > Which release are you using ?
> > > > > > > >
> > > > > > > > Cheers
> > > > > > > >
> > > > > > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy Vsekhvalnov <
> > > > > > > > dvsekhvalnov@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > we several time faced situation where consumer-group
> started
> > to
> > > > > > > > re-consume
> > > > > > > > > old events from beginning. Here is scenario:
> > > > > > > > >
> > > > > > > > > 1. x3 broker kafka cluster on top of x3 node zookeeper
> > > > > > > > > 2. RF=3 for all topics
> > > > > > > > > 3. log.retention.hours=168 and
> > offsets.retention.minutes=20160
> > > > > > > > > 4. running sustainable load (pushing events)
> > > > > > > > > 5. doing disaster testing by randomly shutting down 1 of 3
> > > broker
> > > > > > nodes
> > > > > > > > > (then provision new broker back)
> > > > > > > > >
> > > > > > > > > Several times after bouncing broker we faced situation
> where
> > > > > consumer
> > > > > > > > group
> > > > > > > > > started to re-consume old events.
> > > > > > > > >
> > > > > > > > > consumer group:
> > > > > > > > >
> > > > > > > > > 1. enable.auto.commit = false
> > > > > > > > > 2. tried graceful group shutdown, kill -9 and terminating
> AWS
> > > > nodes
> > > > > > > > > 3. never experienced re-consumption for given cases.
> > > > > > > > >
> > > > > > > > > What can cause that old events re-consumption? Is it
> related
> > to
> > > > > > > bouncing
> > > > > > > > > one of brokers? What to search in a logs? Any broker
> settings
> > > to
> > > > > try?
> > > > > > > > >
> > > > > > > > > Thanks in advance.
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: kafka broker loosing offsets?

Posted by Stas Chizhov <sc...@gmail.com>.

You don't have autocmmit enables that means you commit offsets yourself -
correct? If you store them per partition somewhere and fail to clean it up
upon rebalance next time the consumer gets this partition assigned during
next rebalance it can commit old stale offset- can this be the case?


fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <dvsekhvalnov@gmail.com
>:

> Reprocessing same events again - is fine for us (idempotent). While loosing
> data is more critical.
>
> What are reasons of such behaviour? Consumers are never idle, always
> commiting, probably something wrong with broker setup then?
>
> On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Stas:
> > bq.  using anything but none is not really an option
> >
> > If you have time, can you explain a bit more ?
> >
> > Thanks
> >
> > On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <sc...@gmail.com> wrote:
> >
> > > If you set auto.offset.reset to none next time it happens you will be
> in
> > > much better position to find out what happens. Also in general with
> > current
> > > semantics of offset reset policy IMO using anything but none is not
> > really
> > > an option unless it is ok for consumer to loose some data (latest) or
> > > reprocess it second time (earliest).
> > >
> > > fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <yu...@gmail.com>:
> > >
> > > > Should Kafka log warning if log.retention.hours is lower than number
> of
> > > > hours specified by offsets.retention.minutes ?
> > > >
> > > > On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <manikumar.reddy@gmail.com
> >
> > > > wrote:
> > > >
> > > > > normally, log.retention.hours (168hrs)  should be higher than
> > > > > offsets.retention.minutes (336 hrs)?
> > > > >
> > > > >
> > > > > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy Vsekhvalnov <
> > > > > dvsekhvalnov@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Ted,
> > > > > >
> > > > > > Broker: v0.11.0.0
> > > > > >
> > > > > > Consumer:
> > > > > > kafka-clients v0.11.0.0
> > > > > > auto.offset.reset = earliest
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <yu...@gmail.com>
> > wrote:
> > > > > >
> > > > > > > What's the value for auto.offset.reset  ?
> > > > > > >
> > > > > > > Which release are you using ?
> > > > > > >
> > > > > > > Cheers
> > > > > > >
> > > > > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy Vsekhvalnov <
> > > > > > > dvsekhvalnov@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > we several time faced situation where consumer-group started
> to
> > > > > > > re-consume
> > > > > > > > old events from beginning. Here is scenario:
> > > > > > > >
> > > > > > > > 1. x3 broker kafka cluster on top of x3 node zookeeper
> > > > > > > > 2. RF=3 for all topics
> > > > > > > > 3. log.retention.hours=168 and
> offsets.retention.minutes=20160
> > > > > > > > 4. running sustainable load (pushing events)
> > > > > > > > 5. doing disaster testing by randomly shutting down 1 of 3
> > broker
> > > > > nodes
> > > > > > > > (then provision new broker back)
> > > > > > > >
> > > > > > > > Several times after bouncing broker we faced situation where
> > > > consumer
> > > > > > > group
> > > > > > > > started to re-consume old events.
> > > > > > > >
> > > > > > > > consumer group:
> > > > > > > >
> > > > > > > > 1. enable.auto.commit = false
> > > > > > > > 2. tried graceful group shutdown, kill -9 and terminating AWS
> > > nodes
> > > > > > > > 3. never experienced re-consumption for given cases.
> > > > > > > >
> > > > > > > > What can cause that old events re-consumption? Is it related
> to
> > > > > > bouncing
> > > > > > > > one of brokers? What to search in a logs? Any broker settings
> > to
> > > > try?
> > > > > > > >
> > > > > > > > Thanks in advance.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: kafka broker loosing offsets?

Posted by Dmitriy Vsekhvalnov <dv...@gmail.com>.

Reprocessing same events again - is fine for us (idempotent). While loosing
data is more critical.

What are reasons of such behaviour? Consumers are never idle, always
commiting, probably something wrong with broker setup then?

On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <yu...@gmail.com> wrote:

> Stas:
> bq.  using anything but none is not really an option
>
> If you have time, can you explain a bit more ?
>
> Thanks
>
> On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <sc...@gmail.com> wrote:
>
> > If you set auto.offset.reset to none next time it happens you will be in
> > much better position to find out what happens. Also in general with
> current
> > semantics of offset reset policy IMO using anything but none is not
> really
> > an option unless it is ok for consumer to loose some data (latest) or
> > reprocess it second time (earliest).
> >
> > fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <yu...@gmail.com>:
> >
> > > Should Kafka log warning if log.retention.hours is lower than number of
> > > hours specified by offsets.retention.minutes ?
> > >
> > > On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <ma...@gmail.com>
> > > wrote:
> > >
> > > > normally, log.retention.hours (168hrs)  should be higher than
> > > > offsets.retention.minutes (336 hrs)?
> > > >
> > > >
> > > > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy Vsekhvalnov <
> > > > dvsekhvalnov@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Ted,
> > > > >
> > > > > Broker: v0.11.0.0
> > > > >
> > > > > Consumer:
> > > > > kafka-clients v0.11.0.0
> > > > > auto.offset.reset = earliest
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <yu...@gmail.com>
> wrote:
> > > > >
> > > > > > What's the value for auto.offset.reset  ?
> > > > > >
> > > > > > Which release are you using ?
> > > > > >
> > > > > > Cheers
> > > > > >
> > > > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy Vsekhvalnov <
> > > > > > dvsekhvalnov@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > we several time faced situation where consumer-group started to
> > > > > > re-consume
> > > > > > > old events from beginning. Here is scenario:
> > > > > > >
> > > > > > > 1. x3 broker kafka cluster on top of x3 node zookeeper
> > > > > > > 2. RF=3 for all topics
> > > > > > > 3. log.retention.hours=168 and offsets.retention.minutes=20160
> > > > > > > 4. running sustainable load (pushing events)
> > > > > > > 5. doing disaster testing by randomly shutting down 1 of 3
> broker
> > > > nodes
> > > > > > > (then provision new broker back)
> > > > > > >
> > > > > > > Several times after bouncing broker we faced situation where
> > > consumer
> > > > > > group
> > > > > > > started to re-consume old events.
> > > > > > >
> > > > > > > consumer group:
> > > > > > >
> > > > > > > 1. enable.auto.commit = false
> > > > > > > 2. tried graceful group shutdown, kill -9 and terminating AWS
> > nodes
> > > > > > > 3. never experienced re-consumption for given cases.
> > > > > > >
> > > > > > > What can cause that old events re-consumption? Is it related to
> > > > > bouncing
> > > > > > > one of brokers? What to search in a logs? Any broker settings
> to
> > > try?
> > > > > > >
> > > > > > > Thanks in advance.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: kafka broker loosing offsets?

Posted by Ted Yu <yu...@gmail.com>.

Stas:
bq.  using anything but none is not really an option

If you have time, can you explain a bit more ?

Thanks

On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <sc...@gmail.com> wrote:

> If you set auto.offset.reset to none next time it happens you will be in
> much better position to find out what happens. Also in general with current
> semantics of offset reset policy IMO using anything but none is not really
> an option unless it is ok for consumer to loose some data (latest) or
> reprocess it second time (earliest).
>
> fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <yu...@gmail.com>:
>
> > Should Kafka log warning if log.retention.hours is lower than number of
> > hours specified by offsets.retention.minutes ?
> >
> > On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <ma...@gmail.com>
> > wrote:
> >
> > > normally, log.retention.hours (168hrs)  should be higher than
> > > offsets.retention.minutes (336 hrs)?
> > >
> > >
> > > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy Vsekhvalnov <
> > > dvsekhvalnov@gmail.com>
> > > wrote:
> > >
> > > > Hi Ted,
> > > >
> > > > Broker: v0.11.0.0
> > > >
> > > > Consumer:
> > > > kafka-clients v0.11.0.0
> > > > auto.offset.reset = earliest
> > > >
> > > >
> > > >
> > > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <yu...@gmail.com> wrote:
> > > >
> > > > > What's the value for auto.offset.reset  ?
> > > > >
> > > > > Which release are you using ?
> > > > >
> > > > > Cheers
> > > > >
> > > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy Vsekhvalnov <
> > > > > dvsekhvalnov@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > we several time faced situation where consumer-group started to
> > > > > re-consume
> > > > > > old events from beginning. Here is scenario:
> > > > > >
> > > > > > 1. x3 broker kafka cluster on top of x3 node zookeeper
> > > > > > 2. RF=3 for all topics
> > > > > > 3. log.retention.hours=168 and offsets.retention.minutes=20160
> > > > > > 4. running sustainable load (pushing events)
> > > > > > 5. doing disaster testing by randomly shutting down 1 of 3 broker
> > > nodes
> > > > > > (then provision new broker back)
> > > > > >
> > > > > > Several times after bouncing broker we faced situation where
> > consumer
> > > > > group
> > > > > > started to re-consume old events.
> > > > > >
> > > > > > consumer group:
> > > > > >
> > > > > > 1. enable.auto.commit = false
> > > > > > 2. tried graceful group shutdown, kill -9 and terminating AWS
> nodes
> > > > > > 3. never experienced re-consumption for given cases.
> > > > > >
> > > > > > What can cause that old events re-consumption? Is it related to
> > > > bouncing
> > > > > > one of brokers? What to search in a logs? Any broker settings to
> > try?
> > > > > >
> > > > > > Thanks in advance.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: kafka broker loosing offsets?

Posted by Stas Chizhov <sc...@gmail.com>.

If you set auto.offset.reset to none next time it happens you will be in
much better position to find out what happens. Also in general with current
semantics of offset reset policy IMO using anything but none is not really
an option unless it is ok for consumer to loose some data (latest) or
reprocess it second time (earliest).

fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <yu...@gmail.com>:

> Should Kafka log warning if log.retention.hours is lower than number of
> hours specified by offsets.retention.minutes ?
>
> On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <ma...@gmail.com>
> wrote:
>
> > normally, log.retention.hours (168hrs)  should be higher than
> > offsets.retention.minutes (336 hrs)?
> >
> >
> > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy Vsekhvalnov <
> > dvsekhvalnov@gmail.com>
> > wrote:
> >
> > > Hi Ted,
> > >
> > > Broker: v0.11.0.0
> > >
> > > Consumer:
> > > kafka-clients v0.11.0.0
> > > auto.offset.reset = earliest
> > >
> > >
> > >
> > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > > What's the value for auto.offset.reset  ?
> > > >
> > > > Which release are you using ?
> > > >
> > > > Cheers
> > > >
> > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy Vsekhvalnov <
> > > > dvsekhvalnov@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > we several time faced situation where consumer-group started to
> > > > re-consume
> > > > > old events from beginning. Here is scenario:
> > > > >
> > > > > 1. x3 broker kafka cluster on top of x3 node zookeeper
> > > > > 2. RF=3 for all topics
> > > > > 3. log.retention.hours=168 and offsets.retention.minutes=20160
> > > > > 4. running sustainable load (pushing events)
> > > > > 5. doing disaster testing by randomly shutting down 1 of 3 broker
> > nodes
> > > > > (then provision new broker back)
> > > > >
> > > > > Several times after bouncing broker we faced situation where
> consumer
> > > > group
> > > > > started to re-consume old events.
> > > > >
> > > > > consumer group:
> > > > >
> > > > > 1. enable.auto.commit = false
> > > > > 2. tried graceful group shutdown, kill -9 and terminating AWS nodes
> > > > > 3. never experienced re-consumption for given cases.
> > > > >
> > > > > What can cause that old events re-consumption? Is it related to
> > > bouncing
> > > > > one of brokers? What to search in a logs? Any broker settings to
> try?
> > > > >
> > > > > Thanks in advance.
> > > > >
> > > >
> > >
> >
>

Re: kafka broker loosing offsets?

Posted by Ted Yu <yu...@gmail.com>.

Should Kafka log warning if log.retention.hours is lower than number of
hours specified by offsets.retention.minutes ?

On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <ma...@gmail.com> wrote:

> normally, log.retention.hours (168hrs)  should be higher than
> offsets.retention.minutes (336 hrs)?
>
>
> On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy Vsekhvalnov <
> dvsekhvalnov@gmail.com>
> wrote:
>
> > Hi Ted,
> >
> > Broker: v0.11.0.0
> >
> > Consumer:
> > kafka-clients v0.11.0.0
> > auto.offset.reset = earliest
> >
> >
> >
> > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > What's the value for auto.offset.reset  ?
> > >
> > > Which release are you using ?
> > >
> > > Cheers
> > >
> > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy Vsekhvalnov <
> > > dvsekhvalnov@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > we several time faced situation where consumer-group started to
> > > re-consume
> > > > old events from beginning. Here is scenario:
> > > >
> > > > 1. x3 broker kafka cluster on top of x3 node zookeeper
> > > > 2. RF=3 for all topics
> > > > 3. log.retention.hours=168 and offsets.retention.minutes=20160
> > > > 4. running sustainable load (pushing events)
> > > > 5. doing disaster testing by randomly shutting down 1 of 3 broker
> nodes
> > > > (then provision new broker back)
> > > >
> > > > Several times after bouncing broker we faced situation where consumer
> > > group
> > > > started to re-consume old events.
> > > >
> > > > consumer group:
> > > >
> > > > 1. enable.auto.commit = false
> > > > 2. tried graceful group shutdown, kill -9 and terminating AWS nodes
> > > > 3. never experienced re-consumption for given cases.
> > > >
> > > > What can cause that old events re-consumption? Is it related to
> > bouncing
> > > > one of brokers? What to search in a logs? Any broker settings to try?
> > > >
> > > > Thanks in advance.
> > > >
> > >
> >
>

Re: kafka broker loosing offsets?

Posted by Manikumar <ma...@gmail.com>.

normally, log.retention.hours (168hrs)  should be higher than
offsets.retention.minutes (336 hrs)?


On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy Vsekhvalnov <dv...@gmail.com>
wrote:

> Hi Ted,
>
> Broker: v0.11.0.0
>
> Consumer:
> kafka-clients v0.11.0.0
> auto.offset.reset = earliest
>
>
>
> On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > What's the value for auto.offset.reset  ?
> >
> > Which release are you using ?
> >
> > Cheers
> >
> > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy Vsekhvalnov <
> > dvsekhvalnov@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > we several time faced situation where consumer-group started to
> > re-consume
> > > old events from beginning. Here is scenario:
> > >
> > > 1. x3 broker kafka cluster on top of x3 node zookeeper
> > > 2. RF=3 for all topics
> > > 3. log.retention.hours=168 and offsets.retention.minutes=20160
> > > 4. running sustainable load (pushing events)
> > > 5. doing disaster testing by randomly shutting down 1 of 3 broker nodes
> > > (then provision new broker back)
> > >
> > > Several times after bouncing broker we faced situation where consumer
> > group
> > > started to re-consume old events.
> > >
> > > consumer group:
> > >
> > > 1. enable.auto.commit = false
> > > 2. tried graceful group shutdown, kill -9 and terminating AWS nodes
> > > 3. never experienced re-consumption for given cases.
> > >
> > > What can cause that old events re-consumption? Is it related to
> bouncing
> > > one of brokers? What to search in a logs? Any broker settings to try?
> > >
> > > Thanks in advance.
> > >
> >
>

Re: kafka broker loosing offsets?

Posted by Dmitriy Vsekhvalnov <dv...@gmail.com>.

Hi Ted,

Broker: v0.11.0.0

Consumer:
kafka-clients v0.11.0.0
auto.offset.reset = earliest



On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <yu...@gmail.com> wrote:

> What's the value for auto.offset.reset  ?
>
> Which release are you using ?
>
> Cheers
>
> On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy Vsekhvalnov <
> dvsekhvalnov@gmail.com>
> wrote:
>
> > Hi all,
> >
> > we several time faced situation where consumer-group started to
> re-consume
> > old events from beginning. Here is scenario:
> >
> > 1. x3 broker kafka cluster on top of x3 node zookeeper
> > 2. RF=3 for all topics
> > 3. log.retention.hours=168 and offsets.retention.minutes=20160
> > 4. running sustainable load (pushing events)
> > 5. doing disaster testing by randomly shutting down 1 of 3 broker nodes
> > (then provision new broker back)
> >
> > Several times after bouncing broker we faced situation where consumer
> group
> > started to re-consume old events.
> >
> > consumer group:
> >
> > 1. enable.auto.commit = false
> > 2. tried graceful group shutdown, kill -9 and terminating AWS nodes
> > 3. never experienced re-consumption for given cases.
> >
> > What can cause that old events re-consumption? Is it related to bouncing
> > one of brokers? What to search in a logs? Any broker settings to try?
> >
> > Thanks in advance.
> >
>

Re: kafka broker loosing offsets?

Posted by Ted Yu <yu...@gmail.com>.

What's the value for auto.offset.reset  ?

Which release are you using ?

Cheers

On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy Vsekhvalnov <dv...@gmail.com>
wrote:

> Hi all,
>
> we several time faced situation where consumer-group started to re-consume
> old events from beginning. Here is scenario:
>
> 1. x3 broker kafka cluster on top of x3 node zookeeper
> 2. RF=3 for all topics
> 3. log.retention.hours=168 and offsets.retention.minutes=20160
> 4. running sustainable load (pushing events)
> 5. doing disaster testing by randomly shutting down 1 of 3 broker nodes
> (then provision new broker back)
>
> Several times after bouncing broker we faced situation where consumer group
> started to re-consume old events.
>
> consumer group:
>
> 1. enable.auto.commit = false
> 2. tried graceful group shutdown, kill -9 and terminating AWS nodes
> 3. never experienced re-consumption for given cases.
>
> What can cause that old events re-consumption? Is it related to bouncing
> one of brokers? What to search in a logs? Any broker settings to try?
>
> Thanks in advance.
>