You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@zookeeper.apache.org by "Li Wang (Jira)" <ji...@apache.org> on 2023/02/08 02:09:00 UTC
[jira] [Comment Edited] (ZOOKEEPER-4306) CloseSessionTxn contains too many ephemal nodes cause cluster crash

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685644#comment-17685644 ] 

Li Wang edited comment on ZOOKEEPER-4306 at 2/8/23 2:08 AM:
------------------------------------------------------------

Thanks for the contribution, [~ztzg]! 

I've look at the PR. Limiting the number of ephemeral nodes can be created in a session looks a reasonable solution to me.  Having a way to enforce this will avoid potential OOM issue.    Would it be possible to get this PR move forward?

I've also looked into the feasibility of splitting CloseSessionTxn into smaller ones. Unfortunately, it didn't work, as one request can only have one txn in Zookeeper.  Even though we can split the paths to be deleted into multiple batches and define sub-txn for each batch, we have to wrap all sub-txn(s) into a single wrapper txn and associated it to the request. At the end, when loading zk database, we still have to deserialize the large wrapper txn, which can fail the length check (jute.maxBuffer + zookeeper.jute.maxbuffer.extrasize). 

Changing ZK to allow multiple txns for a single request seems quite involved and it may have other implications. I wonder if anyone has any better ideas or inputs? 



was (Author: liwang):
Thanks for the contribution, [~ztzg]! 

I've look at the PR. Limiting the number of ephemeral nodes can be created in a session looks a reasonable solution to me.  Having a way to enforce this will avoid potential OOM issue.   Can we get this move forward?

I've also looked into the feasibility of splitting CloseSessionTxn into smaller ones, unfortunately, it didn't work, as we can only have one txn for a given request.  Even though we can split the paths to be deleted into multiple batches and define sub-txn for each batch, we have to wrap all sub-txn(s) into a single wrapper txn and associated it to the request. At the end, when loading database, we still have to deserialize the large wrapper txn, which can fail the length check (jute.maxBuffer + zookeeper.jute.maxbuffer.extrasize).







> CloseSessionTxn contains too many ephemal nodes cause cluster crash
> -------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4306
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4306
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.6.2
>            Reporter: Lin Changrui
>            Priority: Critical
>              Labels: pull-request-available
>         Attachments: cs.jpg, f.jpg, l1.png, l2.jpg, r.jpg
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We took a test about how many ephemal nodes can client create under one parent node with defalut configuration. The test caused cluster crash at last, exception stack trace like this.
> follower:
> !f.jpg!
> leader:
> !l1.png!
> !l2.jpg!
> It seems that leader sent a too large txn packet to followers. When follower try to deserialize the txn, it found the txn length out of its buffer size(default 1MB+1MB, jute.maxbuffer + jute.maxbuffer.extrasize). That causes followers crashed, and then, leader found there was no sufficient followers synced, so leader shutdown later. When leader shutdown, it called zkDb.fastForwardDataBase() , and leader found the txn read from txnlog out of its buffer size, so it crashed too.
> After the servers crashed, they try to restart the quorum. But they would not success because the last txn is too large. We lose the log at that moment, but the stack trace is same as this one.
> !r.jpg|width=1468,height=598!
>  
> *Root Cause*
> We use org.apache.zookeeper.server.LogFormatter(-Djute.maxbuffer=74827780) visualize this log and found this. !cs.jpg|width=1400,height=581! So closeSessionTxn contains all ephemal nodes with absolute path. We know we will get a large getChildren respose if we create too many children nodes under one parent node, that is limited by jute.maxbuffer of client. If we create plenty of ephemal nodes under different parent nodes with one session, it may not cause out of buffer of client, but when the session close without delete these node first, it probably cause cluster crash.
> Is it a bug or just a unspecified feature？If it just so, how should we judge the upper limit of creating nodes? 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)