You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by noah <ia...@gmail.com> on 2015/09/24 23:32:22 UTC

Frequent Consumer and Producer Disconnects

We are having issues with producers and consumers frequently fully
disconnecting (from both the brokers and ZK) and reconnecting without any
apparent cause. On our production systems it can happen anywhere from every
10-15 seconds to 15-20 minutes. On our less beefy test systems and
developer laptops, it can happen almost constantly.

We see no errors in the logs (sample attached), just a message for each of
our our consumers and producers disconnecting, then reconnecting. On the
systems where it happens constantly, the consumers are not making any
progress.

The logs on the brokers are equally unhelpful, they show only frequent
connects and reconnects, without any apparent cause.

What could be causing this behavior?

Re: Frequent Consumer and Producer Disconnects

Posted by Todd Palino <tp...@gmail.com>.
Topic creation should only cause a rebalance for wildcard consumers (and I
believe that is regardless of whether or not the wildcard covers the topic
- once the ZK watch fires a rebalance is going to happen).

Back to the original concern, it would be helpful to see more of that log,
in that case. When a rebalance is triggered, there will be a log message
that will indicate why. This is going to be caused by a change in the group
membership (which has a number of causes, but at least it narrows it down)
or a topic change. Figuring out why the consumers are rebalancing is the
first step to trying to reduce it.

-Todd


On Saturday, September 26, 2015, noah <ia...@gmail.com> wrote:

> Thanks, that gives us some more to look at.
>
> That is unfortunately a small section of the log file. When we hit this
> problem (which is not every time,) it will continue like that for hours.
>
> We also still have developers creating topics semi-regularly, which it
> seems like can cause the high level consumer to disconnect?
>
>
> On Fri, Sep 25, 2015 at 6:16 PM Todd Palino <tpalino@gmail.com
> <javascript:_e(%7B%7D,'cvml','tpalino@gmail.com');>> wrote:
>
>> That rebalance cycle doesn't look endless. I see that you started 23
>> consumers, and I see 23 rebalances finishing successfully, which is
>> correct. You will see rebalance messages from all of the consumers you
>> started. It all happens within about 2 seconds, which is fine. I agree that
>> there is a lot of log messages, but I'm not seeing anything that is
>> particularly a problem here. After the segment of pot you provided, your
>> consumers will be running properly. Now, given you have a topic with 16
>> partitions, and you're running 23 consumers, 7 of those consumer threads
>> are going to be idle because they do not own partitions.
>>
>> -Todd
>>
>>
>> On Fri, Sep 25, 2015 at 3:27 PM, noah <iamnoah@gmail.com
>> <javascript:_e(%7B%7D,'cvml','iamnoah@gmail.com');>> wrote:
>>
>>> We're seeing this the most on developer machines that are starting up
>>> multiple high level consumers on the same topic+group as part of service
>>> startup. The consumers do not seem to get a chance to consume anything
>>> before they disconnect.
>>>
>>> These are developer topics, so it is possible/likely that there isn't
>>> anything for them to consume in the topic, but the same service will start
>>> producing, so I would expect them to not be idle for long.
>>>
>>> Could it be the way we are bring up multiple consumers at the same time
>>> is hitting some sort of endless rebalance cycle? And/or the resulting
>>> thrashing is causing them to time out, rebalance, etc.?
>>>
>>> I've tried attaching the logs again. Thanks!
>>>
>>> On Fri, Sep 25, 2015 at 3:33 PM Todd Palino <tpalino@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','tpalino@gmail.com');>> wrote:
>>>
>>>> I don't see the logs attached, but what does the GC look like in your
>>>> applications? A lot of times this is caused (at least on the consumer
>>>> side)
>>>> by the Zookeeper session expiring due to excessive GC activity, which
>>>> causes the consumers to go into a rebalance and change up their
>>>> connections.
>>>>
>>>> -Todd
>>>>
>>>>
>>>> On Fri, Sep 25, 2015 at 1:25 PM, Gwen Shapira <gwen@confluent.io
>>>> <javascript:_e(%7B%7D,'cvml','gwen@confluent.io');>> wrote:
>>>>
>>>> > How busy are the clients?
>>>> >
>>>> > The brokers occasionally close idle connections, this is normal and
>>>> > typically not something to worry about.
>>>> > However, this shouldn't happen to consumers that are actively reading
>>>> data.
>>>> >
>>>> > I'm wondering if the "consumers not making any progress" could be due
>>>> to a
>>>> > different issue, and because they are idle, the connection closes (vs
>>>> the
>>>> > other way around).
>>>> >
>>>> > On Thu, Sep 24, 2015 at 2:32 PM, noah <iamnoah@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','iamnoah@gmail.com');>> wrote:
>>>> >
>>>> > > We are having issues with producers and consumers frequently fully
>>>> > > disconnecting (from both the brokers and ZK) and reconnecting
>>>> without any
>>>> > > apparent cause. On our production systems it can happen anywhere
>>>> from
>>>> > every
>>>> > > 10-15 seconds to 15-20 minutes. On our less beefy test systems and
>>>> > > developer laptops, it can happen almost constantly.
>>>> > >
>>>> > > We see no errors in the logs (sample attached), just a message for
>>>> each
>>>> > of
>>>> > > our our consumers and producers disconnecting, then reconnecting.
>>>> On the
>>>> > > systems where it happens constantly, the consumers are not making
>>>> any
>>>> > > progress.
>>>> > >
>>>> > > The logs on the brokers are equally unhelpful, they show only
>>>> frequent
>>>> > > connects and reconnects, without any apparent cause.
>>>> > >
>>>> > > What could be causing this behavior?
>>>> > >
>>>> > >
>>>> >
>>>>
>>>
>>

Re: Frequent Consumer and Producer Disconnects

Posted by noah <ia...@gmail.com>.
Thanks, that gives us some more to look at.

That is unfortunately a small section of the log file. When we hit this
problem (which is not every time,) it will continue like that for hours.

We also still have developers creating topics semi-regularly, which it
seems like can cause the high level consumer to disconnect?


On Fri, Sep 25, 2015 at 6:16 PM Todd Palino <tp...@gmail.com> wrote:

> That rebalance cycle doesn't look endless. I see that you started 23
> consumers, and I see 23 rebalances finishing successfully, which is
> correct. You will see rebalance messages from all of the consumers you
> started. It all happens within about 2 seconds, which is fine. I agree that
> there is a lot of log messages, but I'm not seeing anything that is
> particularly a problem here. After the segment of pot you provided, your
> consumers will be running properly. Now, given you have a topic with 16
> partitions, and you're running 23 consumers, 7 of those consumer threads
> are going to be idle because they do not own partitions.
>
> -Todd
>
>
> On Fri, Sep 25, 2015 at 3:27 PM, noah <ia...@gmail.com> wrote:
>
>> We're seeing this the most on developer machines that are starting up
>> multiple high level consumers on the same topic+group as part of service
>> startup. The consumers do not seem to get a chance to consume anything
>> before they disconnect.
>>
>> These are developer topics, so it is possible/likely that there isn't
>> anything for them to consume in the topic, but the same service will start
>> producing, so I would expect them to not be idle for long.
>>
>> Could it be the way we are bring up multiple consumers at the same time
>> is hitting some sort of endless rebalance cycle? And/or the resulting
>> thrashing is causing them to time out, rebalance, etc.?
>>
>> I've tried attaching the logs again. Thanks!
>>
>> On Fri, Sep 25, 2015 at 3:33 PM Todd Palino <tp...@gmail.com> wrote:
>>
>>> I don't see the logs attached, but what does the GC look like in your
>>> applications? A lot of times this is caused (at least on the consumer
>>> side)
>>> by the Zookeeper session expiring due to excessive GC activity, which
>>> causes the consumers to go into a rebalance and change up their
>>> connections.
>>>
>>> -Todd
>>>
>>>
>>> On Fri, Sep 25, 2015 at 1:25 PM, Gwen Shapira <gw...@confluent.io> wrote:
>>>
>>> > How busy are the clients?
>>> >
>>> > The brokers occasionally close idle connections, this is normal and
>>> > typically not something to worry about.
>>> > However, this shouldn't happen to consumers that are actively reading
>>> data.
>>> >
>>> > I'm wondering if the "consumers not making any progress" could be due
>>> to a
>>> > different issue, and because they are idle, the connection closes (vs
>>> the
>>> > other way around).
>>> >
>>> > On Thu, Sep 24, 2015 at 2:32 PM, noah <ia...@gmail.com> wrote:
>>> >
>>> > > We are having issues with producers and consumers frequently fully
>>> > > disconnecting (from both the brokers and ZK) and reconnecting
>>> without any
>>> > > apparent cause. On our production systems it can happen anywhere from
>>> > every
>>> > > 10-15 seconds to 15-20 minutes. On our less beefy test systems and
>>> > > developer laptops, it can happen almost constantly.
>>> > >
>>> > > We see no errors in the logs (sample attached), just a message for
>>> each
>>> > of
>>> > > our our consumers and producers disconnecting, then reconnecting. On
>>> the
>>> > > systems where it happens constantly, the consumers are not making any
>>> > > progress.
>>> > >
>>> > > The logs on the brokers are equally unhelpful, they show only
>>> frequent
>>> > > connects and reconnects, without any apparent cause.
>>> > >
>>> > > What could be causing this behavior?
>>> > >
>>> > >
>>> >
>>>
>>
>

Re: Frequent Consumer and Producer Disconnects

Posted by Todd Palino <tp...@gmail.com>.
That rebalance cycle doesn't look endless. I see that you started 23
consumers, and I see 23 rebalances finishing successfully, which is
correct. You will see rebalance messages from all of the consumers you
started. It all happens within about 2 seconds, which is fine. I agree that
there is a lot of log messages, but I'm not seeing anything that is
particularly a problem here. After the segment of pot you provided, your
consumers will be running properly. Now, given you have a topic with 16
partitions, and you're running 23 consumers, 7 of those consumer threads
are going to be idle because they do not own partitions.

-Todd


On Fri, Sep 25, 2015 at 3:27 PM, noah <ia...@gmail.com> wrote:

> We're seeing this the most on developer machines that are starting up
> multiple high level consumers on the same topic+group as part of service
> startup. The consumers do not seem to get a chance to consume anything
> before they disconnect.
>
> These are developer topics, so it is possible/likely that there isn't
> anything for them to consume in the topic, but the same service will start
> producing, so I would expect them to not be idle for long.
>
> Could it be the way we are bring up multiple consumers at the same time is
> hitting some sort of endless rebalance cycle? And/or the resulting
> thrashing is causing them to time out, rebalance, etc.?
>
> I've tried attaching the logs again. Thanks!
>
> On Fri, Sep 25, 2015 at 3:33 PM Todd Palino <tp...@gmail.com> wrote:
>
>> I don't see the logs attached, but what does the GC look like in your
>> applications? A lot of times this is caused (at least on the consumer
>> side)
>> by the Zookeeper session expiring due to excessive GC activity, which
>> causes the consumers to go into a rebalance and change up their
>> connections.
>>
>> -Todd
>>
>>
>> On Fri, Sep 25, 2015 at 1:25 PM, Gwen Shapira <gw...@confluent.io> wrote:
>>
>> > How busy are the clients?
>> >
>> > The brokers occasionally close idle connections, this is normal and
>> > typically not something to worry about.
>> > However, this shouldn't happen to consumers that are actively reading
>> data.
>> >
>> > I'm wondering if the "consumers not making any progress" could be due
>> to a
>> > different issue, and because they are idle, the connection closes (vs
>> the
>> > other way around).
>> >
>> > On Thu, Sep 24, 2015 at 2:32 PM, noah <ia...@gmail.com> wrote:
>> >
>> > > We are having issues with producers and consumers frequently fully
>> > > disconnecting (from both the brokers and ZK) and reconnecting without
>> any
>> > > apparent cause. On our production systems it can happen anywhere from
>> > every
>> > > 10-15 seconds to 15-20 minutes. On our less beefy test systems and
>> > > developer laptops, it can happen almost constantly.
>> > >
>> > > We see no errors in the logs (sample attached), just a message for
>> each
>> > of
>> > > our our consumers and producers disconnecting, then reconnecting. On
>> the
>> > > systems where it happens constantly, the consumers are not making any
>> > > progress.
>> > >
>> > > The logs on the brokers are equally unhelpful, they show only frequent
>> > > connects and reconnects, without any apparent cause.
>> > >
>> > > What could be causing this behavior?
>> > >
>> > >
>> >
>>
>

Re: Frequent Consumer and Producer Disconnects

Posted by noah <ia...@gmail.com>.
We're seeing this the most on developer machines that are starting up
multiple high level consumers on the same topic+group as part of service
startup. The consumers do not seem to get a chance to consume anything
before they disconnect.

These are developer topics, so it is possible/likely that there isn't
anything for them to consume in the topic, but the same service will start
producing, so I would expect them to not be idle for long.

Could it be the way we are bring up multiple consumers at the same time is
hitting some sort of endless rebalance cycle? And/or the resulting
thrashing is causing them to time out, rebalance, etc.?

I've tried attaching the logs again. Thanks!

On Fri, Sep 25, 2015 at 3:33 PM Todd Palino <tp...@gmail.com> wrote:

> I don't see the logs attached, but what does the GC look like in your
> applications? A lot of times this is caused (at least on the consumer side)
> by the Zookeeper session expiring due to excessive GC activity, which
> causes the consumers to go into a rebalance and change up their
> connections.
>
> -Todd
>
>
> On Fri, Sep 25, 2015 at 1:25 PM, Gwen Shapira <gw...@confluent.io> wrote:
>
> > How busy are the clients?
> >
> > The brokers occasionally close idle connections, this is normal and
> > typically not something to worry about.
> > However, this shouldn't happen to consumers that are actively reading
> data.
> >
> > I'm wondering if the "consumers not making any progress" could be due to
> a
> > different issue, and because they are idle, the connection closes (vs the
> > other way around).
> >
> > On Thu, Sep 24, 2015 at 2:32 PM, noah <ia...@gmail.com> wrote:
> >
> > > We are having issues with producers and consumers frequently fully
> > > disconnecting (from both the brokers and ZK) and reconnecting without
> any
> > > apparent cause. On our production systems it can happen anywhere from
> > every
> > > 10-15 seconds to 15-20 minutes. On our less beefy test systems and
> > > developer laptops, it can happen almost constantly.
> > >
> > > We see no errors in the logs (sample attached), just a message for each
> > of
> > > our our consumers and producers disconnecting, then reconnecting. On
> the
> > > systems where it happens constantly, the consumers are not making any
> > > progress.
> > >
> > > The logs on the brokers are equally unhelpful, they show only frequent
> > > connects and reconnects, without any apparent cause.
> > >
> > > What could be causing this behavior?
> > >
> > >
> >
>

Re: Frequent Consumer and Producer Disconnects

Posted by Todd Palino <tp...@gmail.com>.
I don't see the logs attached, but what does the GC look like in your
applications? A lot of times this is caused (at least on the consumer side)
by the Zookeeper session expiring due to excessive GC activity, which
causes the consumers to go into a rebalance and change up their connections.

-Todd


On Fri, Sep 25, 2015 at 1:25 PM, Gwen Shapira <gw...@confluent.io> wrote:

> How busy are the clients?
>
> The brokers occasionally close idle connections, this is normal and
> typically not something to worry about.
> However, this shouldn't happen to consumers that are actively reading data.
>
> I'm wondering if the "consumers not making any progress" could be due to a
> different issue, and because they are idle, the connection closes (vs the
> other way around).
>
> On Thu, Sep 24, 2015 at 2:32 PM, noah <ia...@gmail.com> wrote:
>
> > We are having issues with producers and consumers frequently fully
> > disconnecting (from both the brokers and ZK) and reconnecting without any
> > apparent cause. On our production systems it can happen anywhere from
> every
> > 10-15 seconds to 15-20 minutes. On our less beefy test systems and
> > developer laptops, it can happen almost constantly.
> >
> > We see no errors in the logs (sample attached), just a message for each
> of
> > our our consumers and producers disconnecting, then reconnecting. On the
> > systems where it happens constantly, the consumers are not making any
> > progress.
> >
> > The logs on the brokers are equally unhelpful, they show only frequent
> > connects and reconnects, without any apparent cause.
> >
> > What could be causing this behavior?
> >
> >
>

Re: Frequent Consumer and Producer Disconnects

Posted by Gwen Shapira <gw...@confluent.io>.
How busy are the clients?

The brokers occasionally close idle connections, this is normal and
typically not something to worry about.
However, this shouldn't happen to consumers that are actively reading data.

I'm wondering if the "consumers not making any progress" could be due to a
different issue, and because they are idle, the connection closes (vs the
other way around).

On Thu, Sep 24, 2015 at 2:32 PM, noah <ia...@gmail.com> wrote:

> We are having issues with producers and consumers frequently fully
> disconnecting (from both the brokers and ZK) and reconnecting without any
> apparent cause. On our production systems it can happen anywhere from every
> 10-15 seconds to 15-20 minutes. On our less beefy test systems and
> developer laptops, it can happen almost constantly.
>
> We see no errors in the logs (sample attached), just a message for each of
> our our consumers and producers disconnecting, then reconnecting. On the
> systems where it happens constantly, the consumers are not making any
> progress.
>
> The logs on the brokers are equally unhelpful, they show only frequent
> connects and reconnects, without any apparent cause.
>
> What could be causing this behavior?
>
>