You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Steven Taschuk <st...@stripe.com.INVALID> on 2019/05/02 21:39:59 UTC

ISR briefly shrinks then expands

Our kafka broker logs show episodes where several partitions have
"Shrinking ISR ..." messages followed, usually less than 2 seconds
later, by corresponding "Expanding ISR ..." messages that restore the
original set of brokers for all the partitions.  Does anyone have any
suggestions on how to investigate this?

On our main cluster, consisting of 12 brokers running kafka 2.0.1,
with 2500-3000 partitions in 300-400 topics (each partition having 3
replicas), we see about 10 such episodes a day, each involving
typically 5-8 partitions.  The set of partitions varies but often
repeats or nearly repeats; to illustrate that, here's the partitions
affected in one broker's episodes over the past week:
    2019-04-27T17:51
status-4,dining-11,tax-0,education-0,government-7,locker-19,credit-11,law-15
    2019-04-27T18:01    family-14,health-13,golf-2,phone-14,news-15,peace-13
    2019-04-27T22:59
stock-13,income-2,insurance-0,college-0,district-1,breast-8,back-41
    2019-04-28T02:35    mg-12,executive-18,nursing-12
    2019-04-28T02:51    mg-12,executive-18,nursing-12
    2019-04-28T08:34
health-25,living-10,supra-8,death-15,drug-3,talk-12,cell-0
    2019-04-28T12:15
health-25,living-10,supra-8,death-15,drug-3,talk-12,cell-0
    2019-04-28T19:03
health-25,living-10,supra-8,death-15,drug-3,talk-12,cell-0
    2019-04-28T20:16
health-25,living-10,supra-8,death-15,drug-3,talk-12,cell-0
    2019-04-29T18:16
climate-11,lemon-12,faculty-22,side-4,music-12,police-0,room-0,press-8,parking-3,subject-10,blood-3
    2019-04-30T22:54    living-10,death-15,drug-3,talk-12,cell-12
    2019-05-01T16:52
oil-4,community-5,ice-7,public-2,substance-0,grocery-9,carbon-27,g-12
    2019-05-01T17:01
child-16,community-5,ice-7,public-2,grocery-9,carbon-27,task-3,g-12
    2019-05-01T17:36    community-5,ice-7,public-2,grocery-9,carbon-27,g-12
    2019-05-01T22:14
 school-9,interest-2,kitchen-0,hotel-3,carbon-3,heart-14
(I've replaced the actual topic names with common words.)

We see similar behaviour on another cluster, consisting of 7 brokers
running kafka 2.2.0.

We haven't found anything unusual in the surrounding logs, or in
metrics about the network and disk activity of the brokers.

Some similar-looking issues from jira:
https://issues.apache.org/jira/browse/KAFKA-4003
    similar in that the expand happens within a second or two of the shrink
https://issues.apache.org/jira/browse/KAFKA-4674
https://issues.apache.org/jira/browse/KAFKA-3916
    both of these involve disconnections, which we don't see
https://issues.apache.org/jira/browse/KAFKA-7152
    talks about a constant churn of shrink/expand, which we don't see
(also, was fixed in 2.1.0, and we see it on 2.2.0)

Re: ISR briefly shrinks then expands

Posted by Steven Taschuk <st...@stripe.com.INVALID>.
gc pauses (in milliseconds) for all brokers in the cluster for two minutes
around one recent episode:
min  max  count
 7     8    51
 6     8    49
 6     9    53  <-- broker removed from isr
 6     9    55
 7     9    54
 7    10    52
 6    11    56
 7    11    54
 6    12    58
 7    14    52
10    18    53  <-- leader of affected partitions
12    24    56

So, all the pauses are brief (low tens of ms), and I wouldn't say the
affected brokers are outliers wrt the unaffected brokers.



On Fri, May 3, 2019 at 12:20 AM Yz Xyz <or...@gmail.com> wrote:

>
> Hi Steven -  what are  pause times, gc and memory usage on Kafka brokers?
> When pause times are high kafka broker disconnects from zk. goes out of
> cluster and rejoins later makes shrink and expand on isr.
>
>
> > On May 2, 2019, at 5:39 PM, Steven Taschuk <st...@stripe.com.invalid>
> wrote:
> >
> > Our kafka broker logs show episodes where several partitions have
> > "Shrinking ISR ..." messages followed, usually less than 2 seconds
> > later, by corresponding "Expanding ISR ..." messages that restore the
> > original set of brokers for all the partitions.  Does anyone have any
> > suggestions on how to investigate this?
> >
> > On our main cluster, consisting of 12 brokers running kafka 2.0.1,
> > with 2500-3000 partitions in 300-400 topics (each partition having 3
> > replicas), we see about 10 such episodes a day, each involving
> > typically 5-8 partitions.  The set of partitions varies but often
> > repeats or nearly repeats; to illustrate that, here's the partitions
> > affected in one broker's episodes over the past week:
> >    2019-04-27T17:51
> >
> status-4,dining-11,tax-0,education-0,government-7,locker-19,credit-11,law-15
> >    2019-04-27T18:01
> family-14,health-13,golf-2,phone-14,news-15,peace-13
> >    2019-04-27T22:59
> > stock-13,income-2,insurance-0,college-0,district-1,breast-8,back-41
> >    2019-04-28T02:35    mg-12,executive-18,nursing-12
> >    2019-04-28T02:51    mg-12,executive-18,nursing-12
> >    2019-04-28T08:34
> > health-25,living-10,supra-8,death-15,drug-3,talk-12,cell-0
> >    2019-04-28T12:15
> > health-25,living-10,supra-8,death-15,drug-3,talk-12,cell-0
> >    2019-04-28T19:03
> > health-25,living-10,supra-8,death-15,drug-3,talk-12,cell-0
> >    2019-04-28T20:16
> > health-25,living-10,supra-8,death-15,drug-3,talk-12,cell-0
> >    2019-04-29T18:16
> >
> climate-11,lemon-12,faculty-22,side-4,music-12,police-0,room-0,press-8,parking-3,subject-10,blood-3
> >    2019-04-30T22:54    living-10,death-15,drug-3,talk-12,cell-12
> >    2019-05-01T16:52
> > oil-4,community-5,ice-7,public-2,substance-0,grocery-9,carbon-27,g-12
> >    2019-05-01T17:01
> > child-16,community-5,ice-7,public-2,grocery-9,carbon-27,task-3,g-12
> >    2019-05-01T17:36
> community-5,ice-7,public-2,grocery-9,carbon-27,g-12
> >    2019-05-01T22:14
> > school-9,interest-2,kitchen-0,hotel-3,carbon-3,heart-14
> > (I've replaced the actual topic names with common words.)
> >
> > We see similar behaviour on another cluster, consisting of 7 brokers
> > running kafka 2.2.0.
> >
> > We haven't found anything unusual in the surrounding logs, or in
> > metrics about the network and disk activity of the brokers.
> >
> > Some similar-looking issues from jira:
> > https://issues.apache.org/jira/browse/KAFKA-4003
> >    similar in that the expand happens within a second or two of the
> shrink
> > https://issues.apache.org/jira/browse/KAFKA-4674
> > https://issues.apache.org/jira/browse/KAFKA-3916
> >    both of these involve disconnections, which we don't see
> > https://issues.apache.org/jira/browse/KAFKA-7152
> >    talks about a constant churn of shrink/expand, which we don't see
> > (also, was fixed in 2.1.0, and we see it on 2.2.0)
>

Re: ISR briefly shrinks then expands

Posted by Yz Xyz <or...@gmail.com>.
Hi Steven -  what are  pause times, gc and memory usage on Kafka brokers?  When pause times are high kafka broker disconnects from zk. goes out of cluster and rejoins later makes shrink and expand on isr.


> On May 2, 2019, at 5:39 PM, Steven Taschuk <st...@stripe.com.invalid> wrote:
> 
> Our kafka broker logs show episodes where several partitions have
> "Shrinking ISR ..." messages followed, usually less than 2 seconds
> later, by corresponding "Expanding ISR ..." messages that restore the
> original set of brokers for all the partitions.  Does anyone have any
> suggestions on how to investigate this?
> 
> On our main cluster, consisting of 12 brokers running kafka 2.0.1,
> with 2500-3000 partitions in 300-400 topics (each partition having 3
> replicas), we see about 10 such episodes a day, each involving
> typically 5-8 partitions.  The set of partitions varies but often
> repeats or nearly repeats; to illustrate that, here's the partitions
> affected in one broker's episodes over the past week:
>    2019-04-27T17:51
> status-4,dining-11,tax-0,education-0,government-7,locker-19,credit-11,law-15
>    2019-04-27T18:01    family-14,health-13,golf-2,phone-14,news-15,peace-13
>    2019-04-27T22:59
> stock-13,income-2,insurance-0,college-0,district-1,breast-8,back-41
>    2019-04-28T02:35    mg-12,executive-18,nursing-12
>    2019-04-28T02:51    mg-12,executive-18,nursing-12
>    2019-04-28T08:34
> health-25,living-10,supra-8,death-15,drug-3,talk-12,cell-0
>    2019-04-28T12:15
> health-25,living-10,supra-8,death-15,drug-3,talk-12,cell-0
>    2019-04-28T19:03
> health-25,living-10,supra-8,death-15,drug-3,talk-12,cell-0
>    2019-04-28T20:16
> health-25,living-10,supra-8,death-15,drug-3,talk-12,cell-0
>    2019-04-29T18:16
> climate-11,lemon-12,faculty-22,side-4,music-12,police-0,room-0,press-8,parking-3,subject-10,blood-3
>    2019-04-30T22:54    living-10,death-15,drug-3,talk-12,cell-12
>    2019-05-01T16:52
> oil-4,community-5,ice-7,public-2,substance-0,grocery-9,carbon-27,g-12
>    2019-05-01T17:01
> child-16,community-5,ice-7,public-2,grocery-9,carbon-27,task-3,g-12
>    2019-05-01T17:36    community-5,ice-7,public-2,grocery-9,carbon-27,g-12
>    2019-05-01T22:14
> school-9,interest-2,kitchen-0,hotel-3,carbon-3,heart-14
> (I've replaced the actual topic names with common words.)
> 
> We see similar behaviour on another cluster, consisting of 7 brokers
> running kafka 2.2.0.
> 
> We haven't found anything unusual in the surrounding logs, or in
> metrics about the network and disk activity of the brokers.
> 
> Some similar-looking issues from jira:
> https://issues.apache.org/jira/browse/KAFKA-4003
>    similar in that the expand happens within a second or two of the shrink
> https://issues.apache.org/jira/browse/KAFKA-4674
> https://issues.apache.org/jira/browse/KAFKA-3916
>    both of these involve disconnections, which we don't see
> https://issues.apache.org/jira/browse/KAFKA-7152
>    talks about a constant churn of shrink/expand, which we don't see
> (also, was fixed in 2.1.0, and we see it on 2.2.0)