You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Michael Stack (Jira)" <ji...@apache.org> on 2020/07/08 18:42:00 UTC
[jira] [Commented] (HBASE-23600) Improve chances of edits landing into hbase:meta even when high load

    [ https://issues.apache.org/jira/browse/HBASE-23600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153849#comment-17153849 ] 

Michael Stack commented on HBASE-23600:
---------------------------------------

Follow-up. I didn't get far w/ this patch; was unable to see much difference. Needs more work. Perhaps better route would be putting up a port for metadata only so Master writes to hbase:meta always land?

> Improve chances of edits landing into hbase:meta even when high load
> --------------------------------------------------------------------
>
>                 Key: HBASE-23600
>                 URL: https://issues.apache.org/jira/browse/HBASE-23600
>             Project: HBase
>          Issue Type: Improvement
>          Components: rpc
>            Reporter: Michael Stack
>            Priority: Major
>         Attachments: priority.rpc.patch
>
>
> Of late I've been testing clusters under high load to study failures and to figure how to effect recovery if cluster is unable to recover on its own.
> One interesting case is a RS that is struggling mostly because writes to HDFS are backed up and sync calls are running very slow taking a long time to complete. The RPC backs up with waiting requests, and eventually goes over one or more bounds. The RS then starts throwing CallQueueTooBigExceptions. This struggling state can last a good while. We throw CQTBEs whatever the priority of the incoming request.
> We throw CQTBE in two places; on original parse of the request before we dispatch it on a handler -- here we check size of all queues and if over the threshold (default 1G), throw the exception -- and then later when we dispatch the request to internal queues, we'll count items in queue and if over default in any one queue (default is 10 * handler count), we'll fail dispatch and again throw CQTBE.
> We shouldn't be running w/ big queues. We should be rejecting Requests we know we'll never process in time before client loses interest (See the CoDel thesis and the implementations added a good while back. See splitting meta project so all requests don't end up on one server). TODO.
> Meantime I was looking to see if having read a high-priority request, if rather than dropping it on the floor, instead, what would happen if I let it through even if above thresholds? My main concern is edits to hbase:meta. When sustained, saturated load on the RS carrying hbase:meta, edits may not land. The result is incomplete Procedures and a disorientated Master. I was playing w/ trying to put off the corruption as long as possible, experimenting (CoDel doesn't do priority at first blush; we probably want to add this).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)