You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Josh Elser (JIRA)" <ji...@apache.org> on 2019/03/18 19:20:00 UTC

[jira] [Commented] (HBASE-22057) Impose upper-bound on size of ZK ops sent in a single multi()

    [ https://issues.apache.org/jira/browse/HBASE-22057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795324#comment-16795324 ] 

Josh Elser commented on HBASE-22057:
------------------------------------

From https://zookeeper.apache.org/doc/r3.4.13/api/index.html

bq.  On success, a list of results is returned. On failure, an exception is raised which contains partial results and error details, see KeeperException.getResults() 

This indicates that we are changes the semantics via our wrapper in ZKUtil. However, we never inspect the results from {{ZKUtil.multiOrSequential}}, so I think it makes this point moot.

> Impose upper-bound on size of ZK ops sent in a single multi()
> -------------------------------------------------------------
>
>                 Key: HBASE-22057
>                 URL: https://issues.apache.org/jira/browse/HBASE-22057
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Major
>             Fix For: 1.6.0, 2.2.0
>
>
> In {{ZKUtil#multiOrSequential}}, we accept a list of {{ZKUtilOp}}'s to pass down to the {{ZooKeeper#multi(Iterable<Op>)}} method.
> One problem with this approach is that we may generate a large list of ZNodes to mutate in one batch which exceeds the allowable client package length, specified by {{jute.maxbuffer}}.
> This problem can manifest when we have a large number of WALs to replicate, queued in ZooKeeper, from a disabled peer. When that peer is dropped, the RS would submit deletes of those queued WALs. The RS will see ConnectionLoss for the resulting {{multi()}} calls it tries to make, because we are sending too large of a client message (because we're trying to delete too many WALs at once). The result (at least in branch-1 ish versions) is that the RS aborts after exceeding the ZK retries (as this operation will never succeed).
> A simple fix would be to impose a maximum number of Ops to run in a single batch inside ZKUtil, and split apart the caller-submitted batch into smaller chunks. Before we make such a change, I do need to make sure that we don't have any expectations on atomicity of the operations. I'm not sure what ZK provides here -- for the above example, splitting up batches of deletes is not an issue, but there could be issues with batches of creates where we only apply some.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)