You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Yifan Ying <na...@gmail.com> on 2016/04/02 04:46:22 UTC

Kafka constant shrinking and expanding after deleting a topic

Hi All,

We deleted a deprecated topic on Kafka cluster(0.8) and started observing
constant 'Expanding ISR for partition' and 'Shrinking ISR for partition'
for other topics. As a result we saw a huge number of under replicated
partitions and very high request latency from Kafka. And it doesn't seem
able to recover itself.

Anyone knows what caused this issue and how to resolve it?

Re: Kafka constant shrinking and expanding after deleting a topic

Posted by Guozhang Wang <wa...@gmail.com>.

Alexis,

Hmm, yours seems a bug in Kafka brokers since your message relates to a
topic that has been deleted months ago, indicating that the topic was not
deleted cleanly. Could you file a JIRA with server logs for further
investigation?

Guozhang


On Tue, Apr 5, 2016 at 10:02 PM, Alexis Midon <
alexis.midon@airbnb.com.invalid> wrote:

> I ran into the same issue today. In a production cluster, I noticed the
> "Shrinking ISR for partition" log messages for a topic deleted 2 months
> ago.
> Our staging cluster shows the same messages for all the topics deleted in
> that cluster.
> Both 0.8.2
>
> Yifan, Guozhang, did you find a way to get rid of them?
>
> thanks in advance,
> alexis
>
>
> On Tue, Apr 5, 2016 at 4:16 PM Guozhang Wang <wa...@gmail.com> wrote:
>
> > It is possible, there are some discussions about a similar issue in KIP:
> >
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-53+-+Add+custom+policies+for+reconnect+attempts+to+NetworkdClient
> >
> > mailing thread:
> >
> > https://www.mail-archive.com/dev@kafka.apache.org/msg46868.html
> >
> >
> >
> > Guozhang
> >
> > On Tue, Apr 5, 2016 at 2:34 PM, Yifan Ying <na...@gmail.com> wrote:
> >
> > > Some updates:
> > >
> > > Yesterday, right after release (producers and consumers reconnected to
> > > Kafka/Zookeeper, but no code change in our producers and consumers),
> all
> > > under replication issues were resolved automatically and no more high
> > > latency in both Kafka and Zookeeper. But right after today's
> > > release(producers and consumers re-connected again), the under
> > replication
> > > and high latency issue happened again. So the all-at-once reconnecting
> > from
> > > producers and consumers would cause the problem? And all these only
> > > happened since I deleted a deprecated topic in production.
> > >
> > > Yifan
> > >
> > > On Tue, Apr 5, 2016 at 9:04 AM, Guozhang Wang <wa...@gmail.com>
> > wrote:
> > >
> > >> These configs are mainly dependent on your publish throughput, since
> the
> > >> replication throughput is higher bounded by the publish throughput. If
> > the
> > >> publish throughput is not high, then setting a lower threshold values
> in
> > >> these two configs will cause churns in shrinking / expanding ISRs.
> > >>
> > >> Guozhang
> > >>
> > >> On Mon, Apr 4, 2016 at 11:55 PM, Yifan Ying <na...@gmail.com>
> wrote:
> > >>
> > >>> Thanks for replying, Guozhang. We did increase both settings:
> > >>>
> > >>> replica.lag.max.messages=20000
> > >>>
> > >>> replica.lag.time.max.ms=20000
> > >>>
> > >>>
> > >>> But no sure if these are good enough. And yes, that's a good
> suggestion
> > >>> to monitor ZK performance.
> > >>>
> > >>>
> > >>> Thanks.
> > >>>
> > >>> On Mon, Apr 4, 2016 at 8:58 PM, Guozhang Wang <wa...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> Hmm, it seems like your broker config "replica.lag.max.messages"
> and "
> > >>>> replica.lag.time.max.ms" is mis-configed regarding your replication
> > >>>> traffic, and the deletion of the topic actually makes it below the
> > >>>> threshold. What are the config values for these two? And could you
> > try to
> > >>>> increase these configs and see if that helps?
> > >>>>
> > >>>> In 0.8.2.1 Kafka-consumer-offset-checker.sh access ZK to query the
> > >>>> consumer offsets one-by-one, and hence if your ZK read latency is
> > high it
> > >>>> could take long time. You may want to monitor your ZK cluster
> > performance
> > >>>> to check its read / write latencies.
> > >>>>
> > >>>>
> > >>>> Guozhang
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Mon, Apr 4, 2016 at 10:59 AM, Yifan Ying <na...@gmail.com>
> > wrote:
> > >>>>
> > >>>>> Hi Guozhang,
> > >>>>>
> > >>>>> It's 0.8.2.1. So it should be fixed? We also tried to start from
> > >>>>> scratch by wiping out the data directory on both Kafka and
> > Zookeeper. And
> > >>>>> it's odd that the constant shrinking and expanding happened after
> > fresh
> > >>>>> restart, and high request latency as well. The brokers are using
> the
> > same
> > >>>>> config before topic deletion.
> > >>>>>
> > >>>>> Another observation is that, using the
> > >>>>> Kafka-consumer-offset-checker.sh is extremely slow. Any suggestion
> > would be
> > >>>>> appreciated! Thanks.
> > >>>>>
> > >>>>> On Sun, Apr 3, 2016 at 2:29 PM, Guozhang Wang <wa...@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Yifan,
> > >>>>>>
> > >>>>>> Are you on 0.8.0 or 0.8.1/2? There are some issues with zkVersion
> > >>>>>> checking
> > >>>>>> in 0.8.0 that are fixed in later minor releases of 0.8.
> > >>>>>>
> > >>>>>> Guozhang
> > >>>>>>
> > >>>>>> On Fri, Apr 1, 2016 at 7:46 PM, Yifan Ying <na...@gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>> > Hi All,
> > >>>>>> >
> > >>>>>> > We deleted a deprecated topic on Kafka cluster(0.8) and started
> > >>>>>> observing
> > >>>>>> > constant 'Expanding ISR for partition' and 'Shrinking ISR for
> > >>>>>> partition'
> > >>>>>> > for other topics. As a result we saw a huge number of under
> > >>>>>> replicated
> > >>>>>> > partitions and very high request latency from Kafka. And it
> > doesn't
> > >>>>>> seem
> > >>>>>> > able to recover itself.
> > >>>>>> >
> > >>>>>> > Anyone knows what caused this issue and how to resolve it?
> > >>>>>> >
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> -- Guozhang
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> Yifan
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> -- Guozhang
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Yifan
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> > >> --
> > >> -- Guozhang
> > >>
> > >
> > >
> > >
> > > --
> > > Yifan
> > >
> > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>



-- 
-- Guozhang

Re: Kafka constant shrinking and expanding after deleting a topic

Posted by Yifan Ying <na...@gmail.com>.

Guozhang, thanks for these links.

Hi Alexis, as Guozhang said, yours seems different from our case. We
deleted a topic but caused shrinking/expanding for other topics.

Yifan

On Tue, Apr 5, 2016 at 10:02 PM, Alexis Midon <al...@airbnb.com>
wrote:

> I ran into the same issue today. In a production cluster, I noticed the
> "Shrinking ISR for partition" log messages for a topic deleted 2 months
> ago.
> Our staging cluster shows the same messages for all the topics deleted in
> that cluster.
> Both 0.8.2
>
> Yifan, Guozhang, did you find a way to get rid of them?
>
> thanks in advance,
> alexis
>
>
> On Tue, Apr 5, 2016 at 4:16 PM Guozhang Wang <wa...@gmail.com> wrote:
>
>> It is possible, there are some discussions about a similar issue in KIP:
>>
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-53+-+Add+custom+policies+for+reconnect+attempts+to+NetworkdClient
>>
>> mailing thread:
>>
>> https://www.mail-archive.com/dev@kafka.apache.org/msg46868.html
>>
>>
>>
>> Guozhang
>>
>> On Tue, Apr 5, 2016 at 2:34 PM, Yifan Ying <na...@gmail.com> wrote:
>>
>> > Some updates:
>> >
>> > Yesterday, right after release (producers and consumers reconnected to
>> > Kafka/Zookeeper, but no code change in our producers and consumers), all
>> > under replication issues were resolved automatically and no more high
>> > latency in both Kafka and Zookeeper. But right after today's
>> > release(producers and consumers re-connected again), the under
>> replication
>> > and high latency issue happened again. So the all-at-once reconnecting
>> from
>> > producers and consumers would cause the problem? And all these only
>> > happened since I deleted a deprecated topic in production.
>> >
>> > Yifan
>> >
>> > On Tue, Apr 5, 2016 at 9:04 AM, Guozhang Wang <wa...@gmail.com>
>> wrote:
>> >
>> >> These configs are mainly dependent on your publish throughput, since
>> the
>> >> replication throughput is higher bounded by the publish throughput. If
>> the
>> >> publish throughput is not high, then setting a lower threshold values
>> in
>> >> these two configs will cause churns in shrinking / expanding ISRs.
>> >>
>> >> Guozhang
>> >>
>> >> On Mon, Apr 4, 2016 at 11:55 PM, Yifan Ying <na...@gmail.com>
>> wrote:
>> >>
>> >>> Thanks for replying, Guozhang. We did increase both settings:
>> >>>
>> >>> replica.lag.max.messages=20000
>> >>>
>> >>> replica.lag.time.max.ms=20000
>> >>>
>> >>>
>> >>> But no sure if these are good enough. And yes, that's a good
>> suggestion
>> >>> to monitor ZK performance.
>> >>>
>> >>>
>> >>> Thanks.
>> >>>
>> >>> On Mon, Apr 4, 2016 at 8:58 PM, Guozhang Wang <wa...@gmail.com>
>> >>> wrote:
>> >>>
>> >>>> Hmm, it seems like your broker config "replica.lag.max.messages" and
>> "
>> >>>> replica.lag.time.max.ms" is mis-configed regarding your replication
>> >>>> traffic, and the deletion of the topic actually makes it below the
>> >>>> threshold. What are the config values for these two? And could you
>> try to
>> >>>> increase these configs and see if that helps?
>> >>>>
>> >>>> In 0.8.2.1 Kafka-consumer-offset-checker.sh access ZK to query the
>> >>>> consumer offsets one-by-one, and hence if your ZK read latency is
>> high it
>> >>>> could take long time. You may want to monitor your ZK cluster
>> performance
>> >>>> to check its read / write latencies.
>> >>>>
>> >>>>
>> >>>> Guozhang
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Mon, Apr 4, 2016 at 10:59 AM, Yifan Ying <na...@gmail.com>
>> wrote:
>> >>>>
>> >>>>> Hi Guozhang,
>> >>>>>
>> >>>>> It's 0.8.2.1. So it should be fixed? We also tried to start from
>> >>>>> scratch by wiping out the data directory on both Kafka and
>> Zookeeper. And
>> >>>>> it's odd that the constant shrinking and expanding happened after
>> fresh
>> >>>>> restart, and high request latency as well. The brokers are using
>> the same
>> >>>>> config before topic deletion.
>> >>>>>
>> >>>>> Another observation is that, using the
>> >>>>> Kafka-consumer-offset-checker.sh is extremely slow. Any suggestion
>> would be
>> >>>>> appreciated! Thanks.
>> >>>>>
>> >>>>> On Sun, Apr 3, 2016 at 2:29 PM, Guozhang Wang <wa...@gmail.com>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> Yifan,
>> >>>>>>
>> >>>>>> Are you on 0.8.0 or 0.8.1/2? There are some issues with zkVersion
>> >>>>>> checking
>> >>>>>> in 0.8.0 that are fixed in later minor releases of 0.8.
>> >>>>>>
>> >>>>>> Guozhang
>> >>>>>>
>> >>>>>> On Fri, Apr 1, 2016 at 7:46 PM, Yifan Ying <na...@gmail.com>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>> > Hi All,
>> >>>>>> >
>> >>>>>> > We deleted a deprecated topic on Kafka cluster(0.8) and started
>> >>>>>> observing
>> >>>>>> > constant 'Expanding ISR for partition' and 'Shrinking ISR for
>> >>>>>> partition'
>> >>>>>> > for other topics. As a result we saw a huge number of under
>> >>>>>> replicated
>> >>>>>> > partitions and very high request latency from Kafka. And it
>> doesn't
>> >>>>>> seem
>> >>>>>> > able to recover itself.
>> >>>>>> >
>> >>>>>> > Anyone knows what caused this issue and how to resolve it?
>> >>>>>> >
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> -- Guozhang
>> >>>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Yifan
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> -- Guozhang
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Yifan
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >> --
>> >> -- Guozhang
>> >>
>> >
>> >
>> >
>> > --
>> > Yifan
>> >
>> >
>> >
>>
>>
>> --
>> -- Guozhang
>>
>


-- 
Yifan

Re: Kafka constant shrinking and expanding after deleting a topic

Posted by Alexis Midon <al...@airbnb.com.INVALID>.

I ran into the same issue today. In a production cluster, I noticed the
"Shrinking ISR for partition" log messages for a topic deleted 2 months
ago.
Our staging cluster shows the same messages for all the topics deleted in
that cluster.
Both 0.8.2

Yifan, Guozhang, did you find a way to get rid of them?

thanks in advance,
alexis


On Tue, Apr 5, 2016 at 4:16 PM Guozhang Wang <wa...@gmail.com> wrote:

> It is possible, there are some discussions about a similar issue in KIP:
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-53+-+Add+custom+policies+for+reconnect+attempts+to+NetworkdClient
>
> mailing thread:
>
> https://www.mail-archive.com/dev@kafka.apache.org/msg46868.html
>
>
>
> Guozhang
>
> On Tue, Apr 5, 2016 at 2:34 PM, Yifan Ying <na...@gmail.com> wrote:
>
> > Some updates:
> >
> > Yesterday, right after release (producers and consumers reconnected to
> > Kafka/Zookeeper, but no code change in our producers and consumers), all
> > under replication issues were resolved automatically and no more high
> > latency in both Kafka and Zookeeper. But right after today's
> > release(producers and consumers re-connected again), the under
> replication
> > and high latency issue happened again. So the all-at-once reconnecting
> from
> > producers and consumers would cause the problem? And all these only
> > happened since I deleted a deprecated topic in production.
> >
> > Yifan
> >
> > On Tue, Apr 5, 2016 at 9:04 AM, Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> >> These configs are mainly dependent on your publish throughput, since the
> >> replication throughput is higher bounded by the publish throughput. If
> the
> >> publish throughput is not high, then setting a lower threshold values in
> >> these two configs will cause churns in shrinking / expanding ISRs.
> >>
> >> Guozhang
> >>
> >> On Mon, Apr 4, 2016 at 11:55 PM, Yifan Ying <na...@gmail.com> wrote:
> >>
> >>> Thanks for replying, Guozhang. We did increase both settings:
> >>>
> >>> replica.lag.max.messages=20000
> >>>
> >>> replica.lag.time.max.ms=20000
> >>>
> >>>
> >>> But no sure if these are good enough. And yes, that's a good suggestion
> >>> to monitor ZK performance.
> >>>
> >>>
> >>> Thanks.
> >>>
> >>> On Mon, Apr 4, 2016 at 8:58 PM, Guozhang Wang <wa...@gmail.com>
> >>> wrote:
> >>>
> >>>> Hmm, it seems like your broker config "replica.lag.max.messages" and "
> >>>> replica.lag.time.max.ms" is mis-configed regarding your replication
> >>>> traffic, and the deletion of the topic actually makes it below the
> >>>> threshold. What are the config values for these two? And could you
> try to
> >>>> increase these configs and see if that helps?
> >>>>
> >>>> In 0.8.2.1 Kafka-consumer-offset-checker.sh access ZK to query the
> >>>> consumer offsets one-by-one, and hence if your ZK read latency is
> high it
> >>>> could take long time. You may want to monitor your ZK cluster
> performance
> >>>> to check its read / write latencies.
> >>>>
> >>>>
> >>>> Guozhang
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Apr 4, 2016 at 10:59 AM, Yifan Ying <na...@gmail.com>
> wrote:
> >>>>
> >>>>> Hi Guozhang,
> >>>>>
> >>>>> It's 0.8.2.1. So it should be fixed? We also tried to start from
> >>>>> scratch by wiping out the data directory on both Kafka and
> Zookeeper. And
> >>>>> it's odd that the constant shrinking and expanding happened after
> fresh
> >>>>> restart, and high request latency as well. The brokers are using the
> same
> >>>>> config before topic deletion.
> >>>>>
> >>>>> Another observation is that, using the
> >>>>> Kafka-consumer-offset-checker.sh is extremely slow. Any suggestion
> would be
> >>>>> appreciated! Thanks.
> >>>>>
> >>>>> On Sun, Apr 3, 2016 at 2:29 PM, Guozhang Wang <wa...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Yifan,
> >>>>>>
> >>>>>> Are you on 0.8.0 or 0.8.1/2? There are some issues with zkVersion
> >>>>>> checking
> >>>>>> in 0.8.0 that are fixed in later minor releases of 0.8.
> >>>>>>
> >>>>>> Guozhang
> >>>>>>
> >>>>>> On Fri, Apr 1, 2016 at 7:46 PM, Yifan Ying <na...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>> > Hi All,
> >>>>>> >
> >>>>>> > We deleted a deprecated topic on Kafka cluster(0.8) and started
> >>>>>> observing
> >>>>>> > constant 'Expanding ISR for partition' and 'Shrinking ISR for
> >>>>>> partition'
> >>>>>> > for other topics. As a result we saw a huge number of under
> >>>>>> replicated
> >>>>>> > partitions and very high request latency from Kafka. And it
> doesn't
> >>>>>> seem
> >>>>>> > able to recover itself.
> >>>>>> >
> >>>>>> > Anyone knows what caused this issue and how to resolve it?
> >>>>>> >
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> -- Guozhang
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Yifan
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> -- Guozhang
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Yifan
> >>>
> >>>
> >>>
> >>
> >>
> >> --
> >> -- Guozhang
> >>
> >
> >
> >
> > --
> > Yifan
> >
> >
> >
>
>
> --
> -- Guozhang
>

Re: Kafka constant shrinking and expanding after deleting a topic

Posted by Guozhang Wang <wa...@gmail.com>.

It is possible, there are some discussions about a similar issue in KIP:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-53+-+Add+custom+policies+for+reconnect+attempts+to+NetworkdClient

mailing thread:

https://www.mail-archive.com/dev@kafka.apache.org/msg46868.html



Guozhang

On Tue, Apr 5, 2016 at 2:34 PM, Yifan Ying <na...@gmail.com> wrote:

> Some updates:
>
> Yesterday, right after release (producers and consumers reconnected to
> Kafka/Zookeeper, but no code change in our producers and consumers), all
> under replication issues were resolved automatically and no more high
> latency in both Kafka and Zookeeper. But right after today's
> release(producers and consumers re-connected again), the under replication
> and high latency issue happened again. So the all-at-once reconnecting from
> producers and consumers would cause the problem? And all these only
> happened since I deleted a deprecated topic in production.
>
> Yifan
>
> On Tue, Apr 5, 2016 at 9:04 AM, Guozhang Wang <wa...@gmail.com> wrote:
>
>> These configs are mainly dependent on your publish throughput, since the
>> replication throughput is higher bounded by the publish throughput. If the
>> publish throughput is not high, then setting a lower threshold values in
>> these two configs will cause churns in shrinking / expanding ISRs.
>>
>> Guozhang
>>
>> On Mon, Apr 4, 2016 at 11:55 PM, Yifan Ying <na...@gmail.com> wrote:
>>
>>> Thanks for replying, Guozhang. We did increase both settings:
>>>
>>> replica.lag.max.messages=20000
>>>
>>> replica.lag.time.max.ms=20000
>>>
>>>
>>> But no sure if these are good enough. And yes, that's a good suggestion
>>> to monitor ZK performance.
>>>
>>>
>>> Thanks.
>>>
>>> On Mon, Apr 4, 2016 at 8:58 PM, Guozhang Wang <wa...@gmail.com>
>>> wrote:
>>>
>>>> Hmm, it seems like your broker config "replica.lag.max.messages" and "
>>>> replica.lag.time.max.ms" is mis-configed regarding your replication
>>>> traffic, and the deletion of the topic actually makes it below the
>>>> threshold. What are the config values for these two? And could you try to
>>>> increase these configs and see if that helps?
>>>>
>>>> In 0.8.2.1 Kafka-consumer-offset-checker.sh access ZK to query the
>>>> consumer offsets one-by-one, and hence if your ZK read latency is high it
>>>> could take long time. You may want to monitor your ZK cluster performance
>>>> to check its read / write latencies.
>>>>
>>>>
>>>> Guozhang
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Apr 4, 2016 at 10:59 AM, Yifan Ying <na...@gmail.com> wrote:
>>>>
>>>>> Hi Guozhang,
>>>>>
>>>>> It's 0.8.2.1. So it should be fixed? We also tried to start from
>>>>> scratch by wiping out the data directory on both Kafka and Zookeeper. And
>>>>> it's odd that the constant shrinking and expanding happened after fresh
>>>>> restart, and high request latency as well. The brokers are using the same
>>>>> config before topic deletion.
>>>>>
>>>>> Another observation is that, using the
>>>>> Kafka-consumer-offset-checker.sh is extremely slow. Any suggestion would be
>>>>> appreciated! Thanks.
>>>>>
>>>>> On Sun, Apr 3, 2016 at 2:29 PM, Guozhang Wang <wa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Yifan,
>>>>>>
>>>>>> Are you on 0.8.0 or 0.8.1/2? There are some issues with zkVersion
>>>>>> checking
>>>>>> in 0.8.0 that are fixed in later minor releases of 0.8.
>>>>>>
>>>>>> Guozhang
>>>>>>
>>>>>> On Fri, Apr 1, 2016 at 7:46 PM, Yifan Ying <na...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> > Hi All,
>>>>>> >
>>>>>> > We deleted a deprecated topic on Kafka cluster(0.8) and started
>>>>>> observing
>>>>>> > constant 'Expanding ISR for partition' and 'Shrinking ISR for
>>>>>> partition'
>>>>>> > for other topics. As a result we saw a huge number of under
>>>>>> replicated
>>>>>> > partitions and very high request latency from Kafka. And it doesn't
>>>>>> seem
>>>>>> > able to recover itself.
>>>>>> >
>>>>>> > Anyone knows what caused this issue and how to resolve it?
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -- Guozhang
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Yifan
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> -- Guozhang
>>>>
>>>
>>>
>>>
>>> --
>>> Yifan
>>>
>>>
>>>
>>
>>
>> --
>> -- Guozhang
>>
>
>
>
> --
> Yifan
>
>
>


-- 
-- Guozhang

Re: Kafka constant shrinking and expanding after deleting a topic

Posted by Guozhang Wang <wa...@gmail.com>.

These configs are mainly dependent on your publish throughput, since the
replication throughput is higher bounded by the publish throughput. If the
publish throughput is not high, then setting a lower threshold values in
these two configs will cause churns in shrinking / expanding ISRs.

Guozhang

On Mon, Apr 4, 2016 at 11:55 PM, Yifan Ying <na...@gmail.com> wrote:

> Thanks for replying, Guozhang. We did increase both settings:
>
> replica.lag.max.messages=20000
>
> replica.lag.time.max.ms=20000
>
>
> But no sure if these are good enough. And yes, that's a good suggestion to
> monitor ZK performance.
>
>
> Thanks.
>
> On Mon, Apr 4, 2016 at 8:58 PM, Guozhang Wang <wa...@gmail.com> wrote:
>
>> Hmm, it seems like your broker config "replica.lag.max.messages" and "
>> replica.lag.time.max.ms" is mis-configed regarding your replication
>> traffic, and the deletion of the topic actually makes it below the
>> threshold. What are the config values for these two? And could you try to
>> increase these configs and see if that helps?
>>
>> In 0.8.2.1 Kafka-consumer-offset-checker.sh access ZK to query the
>> consumer offsets one-by-one, and hence if your ZK read latency is high it
>> could take long time. You may want to monitor your ZK cluster performance
>> to check its read / write latencies.
>>
>>
>> Guozhang
>>
>>
>>
>>
>>
>> On Mon, Apr 4, 2016 at 10:59 AM, Yifan Ying <na...@gmail.com> wrote:
>>
>>> Hi Guozhang,
>>>
>>> It's 0.8.2.1. So it should be fixed? We also tried to start from scratch
>>> by wiping out the data directory on both Kafka and Zookeeper. And it's odd
>>> that the constant shrinking and expanding happened after fresh restart, and
>>> high request latency as well. The brokers are using the same config before
>>> topic deletion.
>>>
>>> Another observation is that, using the Kafka-consumer-offset-checker.sh
>>> is extremely slow. Any suggestion would be appreciated! Thanks.
>>>
>>> On Sun, Apr 3, 2016 at 2:29 PM, Guozhang Wang <wa...@gmail.com>
>>> wrote:
>>>
>>>> Yifan,
>>>>
>>>> Are you on 0.8.0 or 0.8.1/2? There are some issues with zkVersion
>>>> checking
>>>> in 0.8.0 that are fixed in later minor releases of 0.8.
>>>>
>>>> Guozhang
>>>>
>>>> On Fri, Apr 1, 2016 at 7:46 PM, Yifan Ying <na...@gmail.com> wrote:
>>>>
>>>> > Hi All,
>>>> >
>>>> > We deleted a deprecated topic on Kafka cluster(0.8) and started
>>>> observing
>>>> > constant 'Expanding ISR for partition' and 'Shrinking ISR for
>>>> partition'
>>>> > for other topics. As a result we saw a huge number of under replicated
>>>> > partitions and very high request latency from Kafka. And it doesn't
>>>> seem
>>>> > able to recover itself.
>>>> >
>>>> > Anyone knows what caused this issue and how to resolve it?
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> -- Guozhang
>>>>
>>>
>>>
>>>
>>> --
>>> Yifan
>>>
>>>
>>>
>>
>>
>> --
>> -- Guozhang
>>
>
>
>
> --
> Yifan
>
>
>


-- 
-- Guozhang

Re: Kafka constant shrinking and expanding after deleting a topic

Posted by Guozhang Wang <wa...@gmail.com>.

Yifan,

Are you on 0.8.0 or 0.8.1/2? There are some issues with zkVersion checking
in 0.8.0 that are fixed in later minor releases of 0.8.

Guozhang

On Fri, Apr 1, 2016 at 7:46 PM, Yifan Ying <na...@gmail.com> wrote:

> Hi All,
>
> We deleted a deprecated topic on Kafka cluster(0.8) and started observing
> constant 'Expanding ISR for partition' and 'Shrinking ISR for partition'
> for other topics. As a result we saw a huge number of under replicated
> partitions and very high request latency from Kafka. And it doesn't seem
> able to recover itself.
>
> Anyone knows what caused this issue and how to resolve it?
>

-- 
-- Guozhang