You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Mike Solomon <ms...@dropbox.com> on 2016/10/12 19:57:54 UTC

outstandingChanges queue grows without bound

I've been performance testing 3.5.2 and hit an interesting unavailability issue.

When there server is very busy (64k connections, 16k writes per
second) the leader can get busy enough that connections get throttled.
Enough throttling causes sessions to expire. As sessions expire, the
CPU consumption rises and the quorum is effectively unavailable.
Interestingly, if you shut down all the clients, the quorum won't heal
for nearly 10 minutes.

The issue is that the outstandingChanges queue has 250k items in it
and the closeSession code scans this linearly under a lock. Replacing
the linear scan with a hash table lookup improves this, but likely the
real solution is some backpressure on clients as a result of an
oversized outstandingChanges queue.

Here is a sample fix:
https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c422b3c8f0c

This results in the quorum healing about 30 seconds after the clients
disconnect.

Is there a way to prevent runaway growth in this queue? I'm wondering
if changing the definition of "throttling" to take into account the
size of this queue might help mitigate this. The end goal is that some
stable amount of traffic is reached asymptotically without suffering a
collapse.

Thanks,
-Mike

Re: outstandingChanges queue grows without bound

Posted by Mike Solomon <ms...@dropbox.com>.
I've pulled this into a separate branch after incorporating some feedback.

https://github.com/msolo/zookeeper/commits/msolo-optimize-close-session

On Fri, Oct 14, 2016 at 12:03 AM, Mike Solomon <ms...@dropbox.com> wrote:
> Thanks for the comments - I'll incorporate them in a future fix. There
> is actually a flaw in this code as it's currently implemented - it
> does not match the original behavior and I need to think more
> carefully.
>
> Arshad, I think ZOOKEEPER-2570 is a somewhat different issue.  The
> root cause in both cases is that the ProcessRequestThread is
> overloaded, but large multi-op transactions are probably a degenerate
> case.
>
> On Thu, Oct 13, 2016 at 1:12 PM, Edward Ribeiro
> <ed...@gmail.com> wrote:
>> Very interesting patch, Mike.
>>
>> I've left a couple of review comments (hope you don't mind) in the
>> https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c
>> 422b3c8f0c commit. :)
>>
>> Cheers,
>> Eddie
>>
>>
>> On Thu, Oct 13, 2016 at 4:06 PM, Arshad Mohammad <
>> arshad.mohammad.k@gmail.com> wrote:
>>
>>> Hi Mike
>>> I also faced same issue. There is test patch in ZOOKEEPER-2570 which can be
>>> used to quickly check  performance gains in each modification.  Hope it is
>>> useful.
>>>
>>> -Arshad
>>>
>>> On Thu, Oct 13, 2016 at 1:27 AM, Mike Solomon <ms...@dropbox.com> wrote:
>>>
>>> > I've been performance testing 3.5.2 and hit an interesting unavailability
>>> > issue.
>>> >
>>> > When there server is very busy (64k connections, 16k writes per
>>> > second) the leader can get busy enough that connections get throttled.
>>> > Enough throttling causes sessions to expire. As sessions expire, the
>>> > CPU consumption rises and the quorum is effectively unavailable.
>>> > Interestingly, if you shut down all the clients, the quorum won't heal
>>> > for nearly 10 minutes.
>>> >
>>> > The issue is that the outstandingChanges queue has 250k items in it
>>> > and the closeSession code scans this linearly under a lock. Replacing
>>> > the linear scan with a hash table lookup improves this, but likely the
>>> > real solution is some backpressure on clients as a result of an
>>> > oversized outstandingChanges queue.
>>> >
>>> > Here is a sample fix:
>>> > https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c
>>> > 422b3c8f0c
>>> >
>>> > This results in the quorum healing about 30 seconds after the clients
>>> > disconnect.
>>> >
>>> > Is there a way to prevent runaway growth in this queue? I'm wondering
>>> > if changing the definition of "throttling" to take into account the
>>> > size of this queue might help mitigate this. The end goal is that some
>>> > stable amount of traffic is reached asymptotically without suffering a
>>> > collapse.
>>> >
>>> > Thanks,
>>> > -Mike
>>> >
>>>

Re: outstandingChanges queue grows without bound

Posted by Mike Solomon <ms...@dropbox.com>.
Thanks for the comments - I'll incorporate them in a future fix. There
is actually a flaw in this code as it's currently implemented - it
does not match the original behavior and I need to think more
carefully.

Arshad, I think ZOOKEEPER-2570 is a somewhat different issue.  The
root cause in both cases is that the ProcessRequestThread is
overloaded, but large multi-op transactions are probably a degenerate
case.

On Thu, Oct 13, 2016 at 1:12 PM, Edward Ribeiro
<ed...@gmail.com> wrote:
> Very interesting patch, Mike.
>
> I've left a couple of review comments (hope you don't mind) in the
> https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c
> 422b3c8f0c commit. :)
>
> Cheers,
> Eddie
>
>
> On Thu, Oct 13, 2016 at 4:06 PM, Arshad Mohammad <
> arshad.mohammad.k@gmail.com> wrote:
>
>> Hi Mike
>> I also faced same issue. There is test patch in ZOOKEEPER-2570 which can be
>> used to quickly check  performance gains in each modification.  Hope it is
>> useful.
>>
>> -Arshad
>>
>> On Thu, Oct 13, 2016 at 1:27 AM, Mike Solomon <ms...@dropbox.com> wrote:
>>
>> > I've been performance testing 3.5.2 and hit an interesting unavailability
>> > issue.
>> >
>> > When there server is very busy (64k connections, 16k writes per
>> > second) the leader can get busy enough that connections get throttled.
>> > Enough throttling causes sessions to expire. As sessions expire, the
>> > CPU consumption rises and the quorum is effectively unavailable.
>> > Interestingly, if you shut down all the clients, the quorum won't heal
>> > for nearly 10 minutes.
>> >
>> > The issue is that the outstandingChanges queue has 250k items in it
>> > and the closeSession code scans this linearly under a lock. Replacing
>> > the linear scan with a hash table lookup improves this, but likely the
>> > real solution is some backpressure on clients as a result of an
>> > oversized outstandingChanges queue.
>> >
>> > Here is a sample fix:
>> > https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c
>> > 422b3c8f0c
>> >
>> > This results in the quorum healing about 30 seconds after the clients
>> > disconnect.
>> >
>> > Is there a way to prevent runaway growth in this queue? I'm wondering
>> > if changing the definition of "throttling" to take into account the
>> > size of this queue might help mitigate this. The end goal is that some
>> > stable amount of traffic is reached asymptotically without suffering a
>> > collapse.
>> >
>> > Thanks,
>> > -Mike
>> >
>>

Re: outstandingChanges queue grows without bound

Posted by Edward Ribeiro <ed...@gmail.com>.
Very interesting patch, Mike.

I've left a couple of review comments (hope you don't mind) in the
https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c
422b3c8f0c commit. :)

Cheers,
Eddie


On Thu, Oct 13, 2016 at 4:06 PM, Arshad Mohammad <
arshad.mohammad.k@gmail.com> wrote:

> Hi Mike
> I also faced same issue. There is test patch in ZOOKEEPER-2570 which can be
> used to quickly check  performance gains in each modification.  Hope it is
> useful.
>
> -Arshad
>
> On Thu, Oct 13, 2016 at 1:27 AM, Mike Solomon <ms...@dropbox.com> wrote:
>
> > I've been performance testing 3.5.2 and hit an interesting unavailability
> > issue.
> >
> > When there server is very busy (64k connections, 16k writes per
> > second) the leader can get busy enough that connections get throttled.
> > Enough throttling causes sessions to expire. As sessions expire, the
> > CPU consumption rises and the quorum is effectively unavailable.
> > Interestingly, if you shut down all the clients, the quorum won't heal
> > for nearly 10 minutes.
> >
> > The issue is that the outstandingChanges queue has 250k items in it
> > and the closeSession code scans this linearly under a lock. Replacing
> > the linear scan with a hash table lookup improves this, but likely the
> > real solution is some backpressure on clients as a result of an
> > oversized outstandingChanges queue.
> >
> > Here is a sample fix:
> > https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c
> > 422b3c8f0c
> >
> > This results in the quorum healing about 30 seconds after the clients
> > disconnect.
> >
> > Is there a way to prevent runaway growth in this queue? I'm wondering
> > if changing the definition of "throttling" to take into account the
> > size of this queue might help mitigate this. The end goal is that some
> > stable amount of traffic is reached asymptotically without suffering a
> > collapse.
> >
> > Thanks,
> > -Mike
> >
>

Re: outstandingChanges queue grows without bound

Posted by Arshad Mohammad <ar...@gmail.com>.
Hi Mike
I also faced same issue. There is test patch in ZOOKEEPER-2570 which can be
used to quickly check  performance gains in each modification.  Hope it is
useful.

-Arshad

On Thu, Oct 13, 2016 at 1:27 AM, Mike Solomon <ms...@dropbox.com> wrote:

> I've been performance testing 3.5.2 and hit an interesting unavailability
> issue.
>
> When there server is very busy (64k connections, 16k writes per
> second) the leader can get busy enough that connections get throttled.
> Enough throttling causes sessions to expire. As sessions expire, the
> CPU consumption rises and the quorum is effectively unavailable.
> Interestingly, if you shut down all the clients, the quorum won't heal
> for nearly 10 minutes.
>
> The issue is that the outstandingChanges queue has 250k items in it
> and the closeSession code scans this linearly under a lock. Replacing
> the linear scan with a hash table lookup improves this, but likely the
> real solution is some backpressure on clients as a result of an
> oversized outstandingChanges queue.
>
> Here is a sample fix:
> https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c
> 422b3c8f0c
>
> This results in the quorum healing about 30 seconds after the clients
> disconnect.
>
> Is there a way to prevent runaway growth in this queue? I'm wondering
> if changing the definition of "throttling" to take into account the
> size of this queue might help mitigate this. The end goal is that some
> stable amount of traffic is reached asymptotically without suffering a
> collapse.
>
> Thanks,
> -Mike
>