You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zookeeper.apache.org by Li Wang <li...@gmail.com> on 2023/02/08 02:49:27 UTC

ZOOKEEPER-4306 CloseSessionTxn contains too many ephemal nodes cause cluster crash

Hello,


We had a production outage due to the issue reported in
https://issues.apache.org/jira/browse/ZOOKEEPER-4306 and some other users
also ran into the same issue. I wonder if we can use this thread to discuss
and come to a consensus on how to fix it. :-)



Thanks Damien Diederen
<https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ztzg> for the
contribution and patch. Limiting the number of ephemeral nodes that can be
created in a session looks like a simple and reasonable solution to me.
Having a way to enforce it will protect the system from potential OOM
issues.


I've also looked into the possibility of splitting CloseSessionTxn into
smaller ones. Unfortunately, it didn't work, as currently in Zookeeper, one
request can only have one txn. Even though we can split the paths to be
deleted into multiple batches and define sub-txn for each batch, we have to
wrap all sub-txn(s) into a single wrapper txn and associate it to the
request. At the end, when loading zk database, we still have to deserialize
the large wrapper txn, which can fail the length check (jute.maxBuffer +
zookeeper.jute.maxbuffer.extrasize).


Changing ZK to allow multiple txns for a single request looks quite
involved and it may have other implications.


I wonder if anyone has any input or any better ideas?



Thanks,


Li

Re: ZOOKEEPER-4306 CloseSessionTxn contains too many ephemal nodes cause cluster crash

Posted by Li Wang <li...@gmail.com>.

Hello,

I would like to follow-up and call for inputs for this. Damien, as the
author of the PR, do you have any inputs/thoughts?

Please let me know if anything I can help with moving this forward.

Cheers,

Li

On Wed, Feb 8, 2023 at 1:08 PM Li Wang <li...@gmail.com> wrote:

> Thanks for the inputs, Enrico.
>
> On Wed, Feb 8, 2023 at 12:26 AM Enrico Olivelli <eo...@gmail.com>
> wrote:
>
>> Li,
>>
>> Il giorno mer 8 feb 2023 alle ore 03:49 Li Wang <li...@gmail.com> ha
>> scritto:
>> >
>> > Hello,
>> >
>> >
>> > We had a production outage due to the issue reported in
>> > https://issues.apache.org/jira/browse/ZOOKEEPER-4306 and some other
>> users
>> > also ran into the same issue. I wonder if we can use this thread to
>> discuss
>> > and come to a consensus on how to fix it. :-)
>> >
>> >
>> >
>> > Thanks Damien Diederen
>> > <https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ztzg> for
>> the
>> > contribution and patch. Limiting the number of ephemeral nodes that can
>> be
>> > created in a session looks like a simple and reasonable solution to me.
>> > Having a way to enforce it will protect the system from potential OOM
>> > issues.
>>
>> How does the client recover from having created too many ephemeral nodes ?
>> This seems not trivial to do. Let me share some ideas:
>>
>
> A new KeeperException/error code
> (i.e.TooManyEphemeralsException/TOOMANYEPHEMERALS) is introduced in the
> patch. Do you mean how
> the old clients handle the new error code?
>
>>
>> Solution one: fail the creation of the node
>> If we fail the creation of the node then the application will probably
>> enter a loop and continue to create it.
>> There is no way to say that some znode is "more important" than other
>> znodes, so the application will keep failing in the creation
>> of random znodes.
>>
>
> How about having a property to control whether throws
> TooManyEphemeralsException in this case? Admin can enable the property
> after all client applications upgrade to the new version and handle the new
> error code.
>
>
>> Solution two: force expires the session (and reset ephemeral nodes)
>> In this case some applications would probably recover in a better way
>> (ZK client applications are supposed to deal with session expiration
>> somehow).
>> and some applications will auto-restart (because session expired is a
>> symptom of network partition and suicide is the best thing to do)
>> In any case the application will try to create the znodes, work for
>> some time, and then die again (or recreate the session)
>>
>
> Great idea! Forcing session expiration seems promising, as it addresses
> both following.
>
> 1.  Protecting the server from txn size getting overflowed
> 2.  No need to worry about backward compatibility issue, as we use an
> existing error code and client application are supposed to handle session
> expiration error
>
>
>> I agree that a short term solution is a server side protection, but it
>> is better to think to a better plan.
>
>
> Totally agree. We need to think through and have a plan on how the client
> apps handle the changes.
> The Solution two seems better, as it is less intrusive and doesn't require
> any client side change. WDYT?
>
> Anyone else have any inputs?
>
>>
>> >
>> >
>> > I've also looked into the possibility of splitting CloseSessionTxn into
>> > smaller ones. Unfortunately, it didn't work, as currently in Zookeeper,
>> one
>> > request can only have one txn. Even though we can split the paths to be
>> > deleted into multiple batches and define sub-txn for each batch, we
>> have to
>> > wrap all sub-txn(s) into a single wrapper txn and associate it to the
>> > request. At the end, when loading zk database, we still have to
>> deserialize
>> > the large wrapper txn, which can fail the length check (jute.maxBuffer +
>> > zookeeper.jute.maxbuffer.extrasize).
>>
>> Unfortunately there are few users that say that zookeeper doesn't
>> scale and probably here we are hitting one of such cases,
>> and most of these cases are due to the write protocol (JUTE), that
>> puts unneeded constraints on Zookeeper
>>
>
> Yes, in this case, we hit the constraint that JUTE doesn't serialize the
> individual sub-txns separately.
>
> Best,
>
> Li
>
>> Enrico
>>
>> >
>> >
>> > Changing ZK to allow multiple txns for a single request looks quite
>> > involved and it may have other implications.
>> >
>> >
>> > I wonder if anyone has any input or any better ideas?
>> >
>> >
>> >
>> > Thanks,
>> >
>> >
>> > Li
>>
>

Re: ZOOKEEPER-4306 CloseSessionTxn contains too many ephemal nodes cause cluster crash

Posted by Li Wang <li...@gmail.com>.

Thanks for the inputs, Enrico.

On Wed, Feb 8, 2023 at 12:26 AM Enrico Olivelli <eo...@gmail.com> wrote:

> Li,
>
> Il giorno mer 8 feb 2023 alle ore 03:49 Li Wang <li...@gmail.com> ha
> scritto:
> >
> > Hello,
> >
> >
> > We had a production outage due to the issue reported in
> > https://issues.apache.org/jira/browse/ZOOKEEPER-4306 and some other
> users
> > also ran into the same issue. I wonder if we can use this thread to
> discuss
> > and come to a consensus on how to fix it. :-)
> >
> >
> >
> > Thanks Damien Diederen
> > <https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ztzg> for
> the
> > contribution and patch. Limiting the number of ephemeral nodes that can
> be
> > created in a session looks like a simple and reasonable solution to me.
> > Having a way to enforce it will protect the system from potential OOM
> > issues.
>
> How does the client recover from having created too many ephemeral nodes ?
> This seems not trivial to do. Let me share some ideas:
>

A new KeeperException/error code
(i.e.TooManyEphemeralsException/TOOMANYEPHEMERALS) is introduced in the
patch. Do you mean how
the old clients handle the new error code?

>
> Solution one: fail the creation of the node
> If we fail the creation of the node then the application will probably
> enter a loop and continue to create it.
> There is no way to say that some znode is "more important" than other
> znodes, so the application will keep failing in the creation
> of random znodes.
>

How about having a property to control whether throws
TooManyEphemeralsException in this case? Admin can enable the property
after all client applications upgrade to the new version and handle the new
error code.


> Solution two: force expires the session (and reset ephemeral nodes)
> In this case some applications would probably recover in a better way
> (ZK client applications are supposed to deal with session expiration
> somehow).
> and some applications will auto-restart (because session expired is a
> symptom of network partition and suicide is the best thing to do)
> In any case the application will try to create the znodes, work for
> some time, and then die again (or recreate the session)
>

Great idea! Forcing session expiration seems promising, as it addresses
both following.

1.  Protecting the server from txn size getting overflowed
2.  No need to worry about backward compatibility issue, as we use an
existing error code and client application are supposed to handle session
expiration error


> I agree that a short term solution is a server side protection, but it
> is better to think to a better plan.


Totally agree. We need to think through and have a plan on how the client
apps handle the changes.
The Solution two seems better, as it is less intrusive and doesn't require
any client side change. WDYT?

Anyone else have any inputs?

>
> >
> >
> > I've also looked into the possibility of splitting CloseSessionTxn into
> > smaller ones. Unfortunately, it didn't work, as currently in Zookeeper,
> one
> > request can only have one txn. Even though we can split the paths to be
> > deleted into multiple batches and define sub-txn for each batch, we have
> to
> > wrap all sub-txn(s) into a single wrapper txn and associate it to the
> > request. At the end, when loading zk database, we still have to
> deserialize
> > the large wrapper txn, which can fail the length check (jute.maxBuffer +
> > zookeeper.jute.maxbuffer.extrasize).
>
> Unfortunately there are few users that say that zookeeper doesn't
> scale and probably here we are hitting one of such cases,
> and most of these cases are due to the write protocol (JUTE), that
> puts unneeded constraints on Zookeeper
>

Yes, in this case, we hit the constraint that JUTE doesn't serialize the
individual sub-txns separately.

Best,

Li

> Enrico
>
> >
> >
> > Changing ZK to allow multiple txns for a single request looks quite
> > involved and it may have other implications.
> >
> >
> > I wonder if anyone has any input or any better ideas?
> >
> >
> >
> > Thanks,
> >
> >
> > Li
>

Re: ZOOKEEPER-4306 CloseSessionTxn contains too many ephemal nodes cause cluster crash

Posted by Enrico Olivelli <eo...@gmail.com>.

Li,

Il giorno mer 8 feb 2023 alle ore 03:49 Li Wang <li...@gmail.com> ha scritto:
>
> Hello,
>
>
> We had a production outage due to the issue reported in
> https://issues.apache.org/jira/browse/ZOOKEEPER-4306 and some other users
> also ran into the same issue. I wonder if we can use this thread to discuss
> and come to a consensus on how to fix it. :-)
>
>
>
> Thanks Damien Diederen
> <https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ztzg> for the
> contribution and patch. Limiting the number of ephemeral nodes that can be
> created in a session looks like a simple and reasonable solution to me.
> Having a way to enforce it will protect the system from potential OOM
> issues.

How does the client recover from having created too many ephemeral nodes ?
This seems not trivial to do. Let me share some ideas:

Solution one: fail the creation of the node
If we fail the creation of the node then the application will probably
enter a loop and continue to create it.
There is no way to say that some znode is "more important" than other
znodes, so the application will keep failing in the creation
of random znodes.

Solution two: force expires the session (and reset ephemeral nodes)
In this case some applications would probably recover in a better way
(ZK client applications are supposed to deal with session expiration
somehow).
and some applications will auto-restart (because session expired is a
symptom of network partition and suicide is the best thing to do)
In any case the application will try to create the znodes, work for
some time, and then die again (or recreate the session)

I agree that a short term solution is a server side protection, but it
is better to think to a better plan.


>
>
> I've also looked into the possibility of splitting CloseSessionTxn into
> smaller ones. Unfortunately, it didn't work, as currently in Zookeeper, one
> request can only have one txn. Even though we can split the paths to be
> deleted into multiple batches and define sub-txn for each batch, we have to
> wrap all sub-txn(s) into a single wrapper txn and associate it to the
> request. At the end, when loading zk database, we still have to deserialize
> the large wrapper txn, which can fail the length check (jute.maxBuffer +
> zookeeper.jute.maxbuffer.extrasize).


Unfortunately there are few users that say that zookeeper doesn't
scale and probably here we are hitting one of such cases,
and most of these cases are due to the write protocol (JUTE), that
puts unneeded constraints on Zookeeper

Enrico

>
>
> Changing ZK to allow multiple txns for a single request looks quite
> involved and it may have other implications.
>
>
> I wonder if anyone has any input or any better ideas?
>
>
>
> Thanks,
>
>
> Li